WO2019207910A1 - Data analysis system and data analysis mehtod - Google Patents

Data analysis system and data analysis mehtod Download PDF

Info

Publication number
WO2019207910A1
WO2019207910A1 PCT/JP2019/005167 JP2019005167W WO2019207910A1 WO 2019207910 A1 WO2019207910 A1 WO 2019207910A1 JP 2019005167 W JP2019005167 W JP 2019005167W WO 2019207910 A1 WO2019207910 A1 WO 2019207910A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
node
score
analysis system
data analysis
Prior art date
Application number
PCT/JP2019/005167
Other languages
French (fr)
Japanese (ja)
Inventor
前川 拓也
Original Assignee
株式会社日立ソリューションズ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立ソリューションズ filed Critical 株式会社日立ソリューションズ
Publication of WO2019207910A1 publication Critical patent/WO2019207910A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • This disclosure relates to a data analysis system.
  • Machine learning technologies such as neural networks are attracting attention. Attempts have been made to solve various problems using a machine learning model obtained by machine learning.
  • a similar case extracting unit that extracts a similar case set that is a set of cases similar to the predicted case from the known case set 1
  • a certainty factor calculation unit 2 that calculates a certainty factor of a prediction attribute value from a similar case set
  • a reliability measure calculation unit 3 that calculates a reliability measure of the certainty factor from the similar case set and the certainty factor
  • a prediction device configured to output a certainty factor of a certain predictive attribute value and a reliability measure of the certainty factor.
  • the method described in Japanese Patent Application Laid-Open No. 2003-323601 adds a reliability measure indicating the reliability of the certainty to the certainty of the prediction based on the similar case, and thereby the user's subsequent prediction result.
  • the user cannot know the degree of contribution of each explanatory variable to the prediction result. In other words, the user cannot know what factor led to the prediction result from the input data. In other words, the user uses the machine learning model while the relevance between the explanatory variable and the objective variable that is the prediction result is unknown in the neural network. For this reason, it is difficult for the user to know what kind of judgment should be made based on the prediction result.
  • the present invention has been made in view of such a situation, and provides a technique for visualizing the degree of influence of an explanatory variable on an objective variable and making it possible to grasp what judgment should be made based on the prediction result. provide.
  • the data analysis system includes an arithmetic device that executes a program and a storage device that is connected to the arithmetic device, and the arithmetic device includes an input that includes a plurality of explanatory variables used by the machine learning model during learning.
  • a feature node calculation that divides an input data set consisting of a data set or a data set obtained by processing the explanatory variable according to a specified division condition and calculates a feature node that represents a feature of a distribution structure of each divided data set.
  • a score calculation unit that calculates a score representing the relationship between the explanatory variable and the objective variable, and the arithmetic device includes: And an output processing unit that outputs the output results including A.
  • the machine learning model has been learned in advance, refers to the learning data used in the learning, and performs a process of obtaining an output result using the learned machine learning model. is there. Further, the machine learning model returns an output signal of a k-dimensional vector in response to an input signal of a d-dimensional vector, and the output signal of the machine learning model in this embodiment belongs to k classification classes. The description will be made assuming that it corresponds to the classification probability.
  • FIG. 1 is a diagram showing the data analysis system configuration of this embodiment.
  • the data analysis system of this embodiment is a computer that analyzes the relationship between input data and output data in machine learning, and includes an input device 101, an output device 102, a display device 103, a processing device 104, and a storage device 111.
  • the input device 101 is a keyboard, a mouse, or the like, and is an interface that receives input from the user.
  • the output device 102 is a printer or the like, and is an interface that outputs the execution result of the program in a format that can be visually recognized by the user.
  • the display device 103 is a display device such as a liquid crystal display device, and is an interface that outputs the execution result of the program in a format that can be visually recognized by the user.
  • a terminal connected to the data analysis system via a network may provide the input device 101, the output device 102, and the display device 103.
  • the processing device 104 includes a processor (arithmetic device) that executes a program and a memory that stores the program and data. Specifically, the input processing unit 106, the feature node calculation unit 107, the score calculation unit 108, the node mapping unit 109, and the output processing unit 110 are realized by the processor executing the program. Note that a part of processing performed by the processor executing the program may be executed by another arithmetic device (for example, FPGA).
  • FPGA arithmetic device
  • the memory includes a ROM that is a nonvolatile storage element and a RAM that is a volatile storage element.
  • the ROM stores an immutable program (for example, BIOS).
  • BIOS an immutable program
  • the RAM is a high-speed and volatile storage element such as a DRAM (Dynamic Random Access Memory), and temporarily stores a program executed by the processor 11 and data used when the program is executed.
  • the storage device 111 is a large-capacity non-volatile storage device such as a magnetic storage device (HDD) or a flash memory (SSD).
  • the storage device 111 stores data used by the processing device 104 when executing the program and a program executed by the processing device 104.
  • the storage device 111 includes a series of processes such as an input data table 112, a normalization information table 113, a division condition table 114, a node information table 115, a node distance table 116, a score table 117, and a weighted average score table 118.
  • the necessary data and output results are stored.
  • the program is read from the storage device 111, loaded into the memory, and executed by the processor.
  • the data analysis system may have a communication interface that controls communication with other devices according to a predetermined protocol.
  • the program executed by the processing device 104 is provided to the data analysis system via a removable medium (CD-ROM, flash memory, etc.) or a network, and is stored in the nonvolatile storage device 111 that is a non-temporary storage medium. For this reason, the data analysis system may have an interface for reading data from a removable medium.
  • a data analysis system is a computer system that is configured on a single physical computer or a plurality of logically or physically configured computers, and is a virtual system constructed on a plurality of physical computer resources. It may operate on a computer.
  • FIG. 2 is a diagram showing the data structure of the data analysis system of this embodiment.
  • the input data table 112 stores data obtained by processing the learning data of the machine learning model into a format used in a series of processes by the data analysis system of the present embodiment, and includes the explanatory variables 1 to d (201) and the objective variables 1 to k (202) is included.
  • the explanatory variables 1 to d (201) represent d-dimensional vectors that are input data of the machine learning model.
  • data is often normalized for each variable.
  • the normalized data is stored in the original numerical data using the normalization information table 113.
  • x_t0, x_t1, x_t2,. . . It can be flattened as a variable name at the value at each time point.
  • the number of dimensions of the explanatory variable 201 and the number of input dimensions of the machine learning model do not match.
  • the data format is converted each time.
  • the objective variables 1 to k (202) are k-dimensional vectors that are output results of the machine learning model.
  • the normalization information table 113 stores information related to normalization processing performed during learning of the machine learning model.
  • the variable ID 203, the variable name 204, the data type 205, the average 206, the standard deviation 207, and the model data format correspondence information 208 are stored. Contains data.
  • the variable ID 203 is an index that identifies the element of the explanatory variable 201.
  • the variable name 204 is the name of the explanatory variable.
  • the data type 205 is the data type of the explanatory variable (for example, logical type, integer type, floating point type, etc.).
  • the average 206 and standard deviation 207 store the average and standard deviation used in the normalization process when learning the machine learning model. However, for variables that are not subjected to normalization processing, such as when the variables are logical, it is preferable to set the average to 0, the standard deviation to 1, and the like.
  • the model data format correspondence information 209 stores information for mutual conversion when the input format of the machine learning model is different from the input format handled by the data analysis system of this embodiment. For example, in the case of data including a time series, the variable x is set to x_t0, x_t1,. . . Therefore, mutual conversion is possible by describing the correspondence between the index before expansion and the index after expansion.
  • the division condition table 114 stores conditions for the feature node calculation unit 107 to divide the input data table 112, and the condition ID 209, the division condition 210, the number of data 211, the map size 212, and the data of the aggregation flags 1 to k (213) are stored. Including.
  • the condition ID 209 is identification information for identifying a condition recorded in the division condition table 114.
  • the division condition 210 is a condition for dividing the input data to obtain one data set. For example, a character string such as an SQL select statement may be used.
  • the division condition 210 may describe a specific value or range for the explanatory variable, or a combination of values for the objective variable.
  • the number of data 211 is the number of data in the input data selected according to the division condition.
  • the map size 212 stores the map size used by the feature node calculation unit 107 in the node vector calculation step 403 of FIG.
  • the value of the map size 212 may be set to NULL or the like, and the result of the automatic setting may be stored.
  • Aggregation flags 1 to k (213) are k flag arrays for objective variables used in the objective score aggregation process 504 in FIG. Only the scores for the objective variables for which 1 is set in this array are aggregated to obtain the objective score. For example, in the analysis aiming at rank-up from the current rank in the member management system, each current rank is set as a division condition, and the flag of the objective variable corresponding to the predicted rank higher than the current rank is set to 1.
  • the node information table 115 stores the result of the feature node calculation by the feature node calculation unit 107, and includes a condition ID 214, a node ID 215, a hit number 216, a hit rate 217, coordinates 218, explanatory variables 1 to d (219), and objective variables 1 to k (220) data are included.
  • the condition ID 214 is identification information (condition ID 209) for identifying a condition recorded in the division condition table 114.
  • the node ID 215 is node identification information that satisfies the condition specified by the condition ID 214.
  • the number of hits 216 is the number of data of the node identified by the node ID 215 in the data set divided by the division condition at a distance closer to the node than the other nodes.
  • the hit rate 217 is a value obtained by dividing the hit number 216 by the data number 211.
  • the coordinates 218 are the processing result of the node mapping process 304 shown in FIG.
  • the explanatory variables 1 to d (219) and the objective variables 1 to k (220) are data in the same format as the input data table 112, and are node vectors representing the characteristics of the distribution structure for the input data set. This vector does not have to match the data included in the input data table, and does not have to follow the type specified in the data type 205. For example, even if a logical value or an integer value is specified, it can be stored as floating point type data.
  • the node distance table 116 stores the distance between nodes for the explanatory variable 219 of the node information table 115 or the node vector obtained by adding the objective variable 220 to the explanatory variable 219, and stores the data of the node from 221, the node to 222, and the distance 223. Including.
  • the node from 221 and the node to 222 are identification information for specifying the nodes recorded in the node information table 115, respectively.
  • the values of the node from 221 and the node to 222 may be a set of the condition ID 214 and the node ID 215, or may be an index on the node information table 115.
  • a distance 223 is a distance of a node vector between the node from 221 and the node to 222.
  • the node distance table 116 may be expressed as a two-dimensional array.
  • the index of the node information table 115 is used for the row and the column.
  • the score table 117 stores the calculation result of the score calculation unit 108, and includes data of the objective variable ID 224, the condition ID 225, the node ID 226, and the score 227 of the explanatory variables 1 to d.
  • the objective variable ID 224 stores the element number of the k-dimensional vector in the output result of the machine learning model.
  • the condition ID 225 and the node ID 226 are identification information for specifying the node recorded in the node information table 115, and use values common to the condition ID 214 and the node ID 215 of the node information table 115.
  • the score 227 of the explanatory variables 1 to d is a calculation result of the score calculation unit 108, and is stored for each objective variable ID 224, condition ID 225, node ID 226, and explanatory variable.
  • the score table 117 also stores a target score and a weighted average score described in FIG.
  • the objective score is set to ⁇ 1 or the like in the objective variable ID 224
  • the weighted average score is set to ⁇ 1 or the like in the objective variable ID 224 and the node ID 226, indicating that the objective variable and the node do not identify a specific thing. Is.
  • the weighted average score table 118 stores the division condition and the score for each explanatory variable in such a manner that the objective variable 224 and the node ID 226 are not specified as -1. Specifically, the weighted average score table 118 divides the weighted average score calculated in step 505 of the score calculation process 303 (FIG. 5) described later for each division condition, and sets the scores of each explanatory variable in descending order of absolute values. A list that is sorted and listed with variable names. The weighted average score table 118 is generated in step 701 of the output process 305 (FIG. 7).
  • the weighted average score table 118 allows the user to easily grasp the explanatory variable having a high influence degree for each target layer represented by each division condition, and compare the rank and score of the explanatory variable under the division condition. For example, in the conditions 1 and 2, the influence degree of the attribute A is large, and in the conditions 3 and 4, the influence degree of the attribute J is large. In addition, the score of the attribute I has the opposite sign, and if the same measure is applied, the effect may appear in reverse. In this way, it can be used for planning measures for the target layer indicated by each condition.
  • FIG. 3 is a flowchart of the overall processing of this embodiment.
  • the input processing unit 106 executes input processing (301).
  • the input processing unit 106 refers to the normalization information table 113, converts the learning data of the machine learning model from the input format to the input format of the present embodiment, and returns the normalized numerical value to the original value. And the result is stored in the input data table 112.
  • the feature node calculation unit 107 executes feature node calculation processing (302). For example, the feature node calculation unit 107 divides the input data table 112 according to the division condition table 114, calculates a feature node from each divided data set, and stores the result in the node information table. Details of the feature node calculation process will be described with reference to FIG.
  • the score calculation unit 108 executes a score calculation process (303). For example, the score calculation unit 108 calculates a score representing the degree of influence of the explanatory variable, and stores the result in the score table. Details of the score calculation process will be described with reference to FIG.
  • the node mapping unit 109 executes a node mapping process (304). For example, the node mapping unit 109 maps the feature node obtained in step 302 to the low-dimensional space. Details of the node mapping process will be described with reference to FIG.
  • the output processing unit 110 executes the output process (305) and ends the process. Details of the output process will be described with reference to FIG.
  • FIG. 4 is a flowchart of the feature node calculation process 302 of this embodiment.
  • the feature node calculation unit 107 loops the variable p from 1 to the number of data in the division condition table 114 (401). Thereafter, the processing from step 402 to step 405 is executed for the p-th division condition.
  • the feature node calculation unit 107 performs data division processing (402). For example, data that satisfies the division condition 210 of the p-th division condition is selected from the input data table 112. The selected data set is subjected to normalization processing using a normalization information table.
  • the feature node calculation unit 107 calculates a node vector (403). For example, a node vector representing the feature is calculated with a smaller number of nodes in consideration of the distribution structure of the selected data set by a clustering method typified by the k-average method.
  • a self-organizing map (hereinafter abbreviated as SOM) is applied.
  • SOM is a kind of neural network expressed by edges connecting nodes arranged in a grid and adjacent nodes. Each node is assigned a reference vector in the same format as the input data.
  • the reference vector is updated so that the reference vector of the node connected to the BMU is approximated to the learning data together with the reference vector of the node having the closest distance to the SOM learning data (hereinafter referred to as BMU (Best Matching Unit)). Since SOM is a known technique, a detailed description of the technique is omitted. By repeating this process, it is possible to map the complicated distribution structure of the learning data to the geometric structure of the node.
  • the reference vector of each node calculated as a result of the SOM is stored in the node information table 115 in the form of an explanatory variable 219 and a target variable 220.
  • the format of the learning data when executing the SOM can be set only by the explanatory variable or by the combination of the explanatory variable and the objective variable. Which format to use is preferably set in advance. And the objective variable 220 as an output result follows the input format of these learning data.
  • the feature node calculation unit 107 counts the number of hits for each node (404).
  • the number of data in the selected data set with BMU as the BMU is calculated as the value of the hit number 216.
  • the hit rate 217 is calculated by dividing this by the number of selected data items.
  • the feature node calculation unit 107 stores the calculated result in the data storage area (405). At this time, since the node vector calculated in step 403 has been normalized, the normalization information table 113 is used to restore the original, and the result is stored.
  • step 401 to step 405 ends, the feature node calculation process 302 ends.
  • FIG. 5 is a flowchart of the score calculation process 303 of this embodiment.
  • the score calculation unit 108 loops the variable i from 1 to the number of data items in the node information table 115 (501). Thereafter, the processing from step 502 to step 504 is executed for the i-th node.
  • the score calculation unit 108 generates a neighborhood data set of the node i and a prediction result of the machine learning model corresponding thereto (502).
  • the neighborhood data is vector data located around the d-dimensional vector represented by the explanatory variable of the node specified by the variable i.
  • a method for generating the neighborhood data a method is used in which the value of the explanatory variable of the node i is averaged, and is generated by a random number according to a normal distribution in which half the standard deviation of the normalized information is a standard deviation.
  • the number of data items in the neighborhood data set may be designated in advance. Prediction using a machine learning model can be executed by performing normalization using a normalization information table and conversion using model data format correspondence information 208.
  • the score calculation unit 108 performs local model estimation processing on the generated neighborhood data set and the prediction result of the machine learning model (503).
  • step 503 a score representing the relationship between the explanatory variable and the objective variable is obtained for the neighborhood data.
  • Y, Y, Si, and C are k-dimensional vectors. Since the linear model estimation technique is a known technique, a detailed description of the technique is omitted.
  • the score calculation unit 108 calculates the target score by counting the scores obtained in step 503 according to the counting flag 213 (504). Specifically, the scores of the elements whose flag is 1 are totaled for each explanatory variable.
  • the score calculation unit 108 calculates the weighted average score by applying the hit rate 217 to the target score (505).
  • the weighted average score is calculated for each explanatory variable for all nodes having the same condition ID.
  • the score calculation unit 108 stores the calculated result in the data storage area (506), and ends the score calculation process.
  • FIG. 6 is a flowchart of the node mapping process 304 of this embodiment.
  • a multidimensional scaling method hereinafter abbreviated as MDS
  • MDS multidimensional scaling method
  • MDS is a technique for mapping nodes in a multidimensional vector space to a low-dimensional space such as 2D or 3D, and performs mapping so that the distance between nodes is reproduced as much as possible. Since MDS is a known technique, a detailed description of the technique is omitted. In this embodiment, when MDS is applied, initialization is performed in consideration of the geometric structure of the SOM node.
  • the node mapping unit 109 generates a node distance table 116 (601).
  • each feature node vector is set as a normalized explanatory variable 219, and a distance table is generated based on the Euclidean distance.
  • the node mapping unit 109 initializes each variable (602). Specifically, first, lt, lb, rt, and rb are respectively defined as upper left, lower left, upper right, and lower right node indexes in the lattice-like SOM node structure, and all are set to -1. Next, y is set to 0. Next, the array Pos is defined as an array for storing the coordinates of each node. Then, Sw and Sh are defined as node coordinate arrays in the x direction and the y direction, respectively. This array size is determined by the map size 212. All elements of Pos, Sw, and Sh are initialized with 0.
  • the node mapping unit 109 loops the variable p from 1 to the number of data items in the division condition table 114 (603). Thereafter, the processing from step 604 to step 609 is executed for the p-th division condition.
  • the node mapping unit 109 inputs a number obtained by adding a predetermined number (eg, 2) to the maximum value in the array Sh to y (605).
  • the predetermined number may be changed to an appropriate value.
  • the node mapping unit 109 proceeds to step 606 without doing anything.
  • the four corner nodes index for the node of the division condition p are set to lt, lb, rt, and rb, respectively (606). At this time, it is possible to set lt to rb + 1 and set the remaining variables according to the map size 212.
  • the node mapping unit 109 sets values for Sw and Sh (607). In this embodiment, a value obtained by equally dividing the distance between the nodes lt and rt and the distance between lt and lb according to the map size is set.
  • the node mapping unit 109 adds y to each element of Sh (608). If it is desired to move in the x-axis direction, a variable x may be defined and the same processing as y may be applied to Sw.
  • the node mapping unit 109 sets the coordinates of the nodes lt to rb to Pos (609).
  • the coordinates of the node at the position of i row and j column in the SOM node structure may be set by (Sw [i], Sh [j]).
  • the node mapping unit 109 stores the result in the storage area (611), and ends the node mapping process.
  • FIG. 7 is a flowchart of the output process 305 of this embodiment.
  • the output processing unit 110 lists the weighted average scores and generates the weighted average score table 118 (701).
  • the weighted average score table 118 divides the weighted average score for each division condition, sorts the scores of each explanatory variable in descending order of absolute values, and lists the variable names.
  • the output processing unit 110 displays a component map of the node vector (702).
  • the component map is obtained by visualizing the value of the specific explanatory variable 319 or the objective variable 220 of each node under the same conditions by the geometric structure of the SOM node and the map size. For example, in the component map of the explanatory variable i when the map size is m ⁇ n, the value of the explanatory variable i in all the nodes having the same condition ID in the node information table is converted into an m ⁇ n image in a color corresponding to the value. indicate.
  • the component map of the present embodiment images the value of the explanatory variable 319 based on the geometric structure of the node for each explanatory variable 219 with respect to a specific division condition.
  • a component map using the objective variable 220 can also be displayed. With the component map, the correlation between each explanatory variable and the correlation between the explanatory variable and the objective variable can be visually grasped.
  • the output processing unit 110 displays a hit map (703).
  • the hit map is obtained by visualizing the hit number 216 (or its logarithm) or the hit rate 217 using the visualization method of step 702.
  • the number of hits is imaged by color coding based on the logarithm of the hit rate 217 as illustrated in FIG. Further, as shown in the figure, the numerical value of the hit number may be displayed. With the hit map, it is possible to grasp the dense nodes in the learning data distribution.
  • the output processing unit 110 displays a score map (704).
  • the score map is obtained by visualizing the score 227 or the objective score for a specific explanatory variable using the visualization method in step 702.
  • the score 227 is imaged by color coding based on the score 227 for each explanatory variable, as illustrated in FIG. For example, if the score is 0, the color is set to green, and the color change gradually changes to red in the positive direction and blue in the negative direction. Easy to grasp. Further, as shown in the figure, by comparing the pattern with the component map of the corresponding explanatory variable, it is possible to grasp the state of the value of the explanatory variable at the node having a high influence level.
  • the output processing unit 110 displays a node map (705).
  • the node map is obtained by visualizing each node as a point in a low-dimensional space based on the coordinates 218 for each node calculated in step 304.
  • the shape and color of a point representing each node may be set according to the value of the explanatory variable in the node information table, the value of the objective variable, the score for each explanatory variable in the score table, the objective score, the division condition, and the like.
  • the node map of this embodiment is obtained by plotting nodes in a two-dimensional space based on the node coordinates 218 under each division condition, as illustrated in FIG. Further, the geometric structure of the node under a specific division condition may be displayed by a grid line.
  • the node map can grasp the positional relationship of each node under a plurality of division conditions. For example, when the current rank is used as a division condition, by looking at nodes that are close to each other, it is possible to easily find a node that is likely to increase in rank or have a high risk of decreasing. Differences in features from those neighboring nodes can be directly compared with values in the node information table or compared using a component map.
  • the data analysis system is configured such that an input data set including a plurality of explanatory variables used by the machine learning model at the time of learning or an input data set including a data set obtained by processing the explanatory variables.
  • a feature node calculation unit 107 that calculates a feature node that represents a feature of a distribution structure of each of the divided data sets, generates near data of input data including the feature node, and is divided according to a specified division condition, Based on the generated explanatory variable of the neighborhood data and the objective variable data obtained by inputting the neighborhood data to the machine learning model, a score representing the relationship between the explanatory variable and the objective variable is calculated.
  • a score calculation unit 108 that outputs the output result including the score.
  • the feature node representing the feature of the distribution structure can represent the feature of the data set with less data than the learning data.
  • the feature node can be complemented by representing the feature of the data set by the neighborhood data. That is, it is possible to represent the characteristics of the data set with a small amount of data and reduce the amount of calculation.
  • the feature node calculation unit 107 calculates the feature node based on the input data set to which the self-organizing map is applied, so that the feature node can be accurately calculated.
  • the feature node calculation unit 107 is an input data set including a plurality of explanatory variables used by the machine learning model during learning and an objective variable calculated by the machine learning model, or the explanatory variable and the objective variable are processed. Since the feature node is calculated using an input data set consisting of data sets, the objective variables can be compared on a map.
  • the feature node calculation unit 107 includes a division condition including at least one of a specific value or range of a specific explanatory variable and a specific value (for example, maximum value) or range of an element of the objective variable, or these Since the input data set is divided according to the dividing condition expressed by the combination, analysis with the target layer narrowed down can be performed. That is, data of a group having a specific attribute can be analyzed by changing the attribute according to the purpose, not the entire group.
  • the score calculation unit 108 calculates a score corresponding to the format of the objective variable for each explanatory variable by applying linear model estimation based on the data of the explanatory variable and the data of the objective variable. Because the linear model is simple and easy to handle, it is easy for the user to understand and can improve the reliability of the results. In particular, in a linear model, when a plurality of attributes are integrated, calculation can be performed with the sum of probabilities, so that the user can easily understand intuitively.
  • the score calculation unit 108 calculates an objective score by summing up the parts specified for each of the division conditions among some of the elements in the objective variable, so that analysis with a narrowed down target layer can be performed. That is, data of a group having a specific attribute can be analyzed by changing the attribute according to the purpose, not the entire group.
  • the score calculation unit 108 calculates a weighted average score for each explanatory variable based on the number of peripheral data for each feature node in each division condition for the calculated score and the calculated target score. Considering the distribution, data characteristics can be expressed correctly.
  • each division condition includes a node mapping unit 109 that maps the feature node calculated by the feature node calculation unit 107 in the two-dimensional space in each division condition, so that the characteristics of the group can be easily expressed. .
  • the node mapping unit 109 calculates the feature node value for each explanatory variable, the calculated score, and the weighted average score calculated for the score and the target score based on the geometric structure of the node. Since data to be displayed as an image is generated, it is possible to compare data between groups of different attributes while maintaining the distance relationship between nodes.
  • the node mapping unit 109 initializes the feature node vector or the feature node vector including the objective variable component based on the geometric structure of the feature node of the division condition, and then performs a multidimensional scaling method. Since mapping is applied and applied, attributes with high influence and attributes with low influence can be expressed in an easy-to-understand manner using the score map.
  • the input data set is time-series data including explanatory variables for each predetermined time
  • the data obtained by expanding the explanatory variables as independent variables from a certain point in the past to the present time is used as input data and used for the expansion. Since the rules are stored, the input data set can be analyzed as time series data.
  • the present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the spirit of the appended claims.
  • the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described.
  • a part of the configuration of one embodiment may be replaced with the configuration of another embodiment.
  • another configuration may be added, deleted, or replaced.
  • each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.
  • Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
  • a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
  • control lines and information lines indicate what is considered necessary for explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Abstract

A data analysis system provided with a computation device for executing a program and a storage device connected to the computation device. The computation device is provided with: a feature node calculation unit for dividing an input data set comprising a plurality of explanatory variables used during learning by a machine learning model, or an input data set in which the explanatory variables comprise a processed data set, under a designated division condition, and calculating a feature node that represents the feature of distribution structure of each of the divided data sets; a score calculation unit for generating neighbor data of input data that includes the feature node and calculating, on the basis of an explanatory variable of the generated neighbor data and the data of an objective variable obtained by inputting the neighbor data to the machine learning model, a score that represents the relationship between the explanatory variable and the objective variable; and an output processing unit for outputting an output result that includes the score.

Description

データ分析システム及びデータ分析方法Data analysis system and data analysis method 参照による取り込みImport by reference
 本出願は、平成30年(2018年)4月24日に出願された日本出願である特願2018-83408の優先権を主張し、その内容を参照することにより、本出願に取り込む。 This application claims the priority of Japanese Patent Application No. 2018-83408, filed on April 24, 2018, and is incorporated into the present application by referring to its contents.
 本開示は、データ分析システムに関する。 This disclosure relates to a data analysis system.
 ニューラルネットワーク等の機械学習技術が注目を集めている。機械学習により得られた機械学習モデルを利用して様々な問題の解決が試みられている。例えば、特開2003-323601号公報においては、既知事例集合と、予測事例が入力された場合に、既知事例集合から予測事例に類似した事例の集合である類似事例集合を抽出する類似事例抽出部1と、類似事例集合から或る予測属性値の確信度を計算する確信度計算部2と、類似事例集合と確信度から、その確信度の信頼性尺度を計算する信頼性尺度計算部3とを備え、ある予測属性値の確信度と、その確信度の信頼性尺度を出力するように構成する予想装置が記載されている。 Machine learning technologies such as neural networks are attracting attention. Attempts have been made to solve various problems using a machine learning model obtained by machine learning. For example, in Japanese Patent Application Laid-Open No. 2003-323601, when a known case set and a predicted case are input, a similar case extracting unit that extracts a similar case set that is a set of cases similar to the predicted case from the known case set 1, a certainty factor calculation unit 2 that calculates a certainty factor of a prediction attribute value from a similar case set, and a reliability measure calculation unit 3 that calculates a reliability measure of the certainty factor from the similar case set and the certainty factor, And a prediction device configured to output a certainty factor of a certain predictive attribute value and a reliability measure of the certainty factor.
 しかしながら、特開2003-323601号公報に記載された手法は、類似事例に基づく予測結果の確信度に、その確信度の信頼度を示す信頼性尺度を付加することにより、予測結果に対するユーザのその後の判断を支援するものであり、ユーザは各説明変数の予測結果に対する寄与度を知ることができない。すなわち、ユーザはどのような要因により入力データから予測結果が導かれたかを知ることができない。換言すると、ユーザは、ニューラルネットワークにおいて説明変数と予測結果である目的変数との関連性が未知のまま機械学習モデルを利用していた。このため、ユーザは予測結果に基づいてどのような判断をすべきか知ることが困難であった。 However, the method described in Japanese Patent Application Laid-Open No. 2003-323601 adds a reliability measure indicating the reliability of the certainty to the certainty of the prediction based on the similar case, and thereby the user's subsequent prediction result. The user cannot know the degree of contribution of each explanatory variable to the prediction result. In other words, the user cannot know what factor led to the prediction result from the input data. In other words, the user uses the machine learning model while the relevance between the explanatory variable and the objective variable that is the prediction result is unknown in the neural network. For this reason, it is difficult for the user to know what kind of judgment should be made based on the prediction result.
 本発明は、このような状況に鑑みてなされたものであり、説明変数が目的変数に与える影響度を可視化して、予測結果に基づいてどのような判断をすべきかを把握可能にする技術を提供する。 The present invention has been made in view of such a situation, and provides a technique for visualizing the degree of influence of an explanatory variable on an objective variable and making it possible to grasp what judgment should be made based on the prediction result. provide.
 本願において開示される発明の代表的な一例を示せば以下の通りである。すなわち、データ分析システムであって、プログラムを実行する演算装置と、前記演算装置と接続された記憶装置とを備え、前記演算装置が、機械学習モデルが学習時に用いた複数の説明変数からなる入力データセット又は前記説明変数が加工されたデータセットからなる入力データセットを、指定された分割条件で分割し、前記分割された各データセットの分布構造の特徴を表す特徴ノードを算出する特徴ノード算出部と、前記演算装置が、前記特徴ノードを含む入力データの近傍データを生成し、前記生成された近傍データの説明変数と、前記近傍データを前記機械学習モデルに入力して得られた目的変数のデータとに基づいて、当該説明変数と当該目的変数との関係性を表すスコアを算出するスコア算出部と、前記演算装置が、前記スコアを含む出力結果を出力する出力処理部とを備える。 A typical example of the invention disclosed in the present application is as follows. That is, the data analysis system includes an arithmetic device that executes a program and a storage device that is connected to the arithmetic device, and the arithmetic device includes an input that includes a plurality of explanatory variables used by the machine learning model during learning. A feature node calculation that divides an input data set consisting of a data set or a data set obtained by processing the explanatory variable according to a specified division condition and calculates a feature node that represents a feature of a distribution structure of each divided data set. And the arithmetic unit generates neighborhood data of the input data including the feature node, the explanatory variable of the generated neighborhood data, and the objective variable obtained by inputting the neighborhood data to the machine learning model A score calculation unit that calculates a score representing the relationship between the explanatory variable and the objective variable, and the arithmetic device includes: And an output processing unit that outputs the output results including A.
 本発明の一態様によれば、説明変数が目的変数に与える影響度を可視化できる。前述した以外の課題、構成及び効果は、以下の実施例の説明によって明らかにされる。 According to one aspect of the present invention, it is possible to visualize the degree of influence of explanatory variables on objective variables. Problems, configurations, and effects other than those described above will become apparent from the following description of embodiments.
本実施例のデータ分析システム構成を表す図である。It is a figure showing the data analysis system structure of a present Example. 本実施例のデータ分析システムのデータ構造を示す図である。It is a figure which shows the data structure of the data analysis system of a present Example. 本実施例の全体処理のフローチャートである。It is a flowchart of the whole process of a present Example. 本実施例の特徴ノード算出処理のフローチャートである。It is a flowchart of the characteristic node calculation process of a present Example. 本実施例のスコア算出処理のフローチャートである。It is a flowchart of the score calculation process of a present Example. 本実施例のノードマッピング処理のフローチャートである。It is a flowchart of the node mapping process of a present Example. 本実施例の出力処理のフローチャートである。It is a flowchart of the output process of a present Example. 本実施例の成分マップの例を示す図である。It is a figure which shows the example of the component map of a present Example. 本実施例のヒットマップの例を示す図である。It is a figure which shows the example of the hit map of a present Example. 本実施例のスコアマップの例を示す図である。It is a figure which shows the example of the score map of a present Example. 本実施例のノードマップの例を示す図である。It is a figure which shows the example of the node map of a present Example.
 <実施例1>
 以下、本発明の実施例を図面を参照して説明する。
<Example 1>
Embodiments of the present invention will be described below with reference to the drawings.
 なお、本実施例では、機械学習モデルは、予め学習済みであり、その学習において利用された学習データを参照し、及び学習済みの機械学習モデルを利用して出力結果を得る処理を行うものである。また、機械学習モデルは、d次元ベクトルの入力信号に対してk次元ベクトルの出力信号を返すものであり、さらに、本実施例での機械学習モデルの出力信号は、k個の分類クラスに属する分類確率に相当するものとして説明する。 In this embodiment, the machine learning model has been learned in advance, refers to the learning data used in the learning, and performs a process of obtaining an output result using the learned machine learning model. is there. Further, the machine learning model returns an output signal of a k-dimensional vector in response to an input signal of a d-dimensional vector, and the output signal of the machine learning model in this embodiment belongs to k classification classes. The description will be made assuming that it corresponds to the classification probability.
 図1は、本実施例のデータ分析システム構成を表す図である。 FIG. 1 is a diagram showing the data analysis system configuration of this embodiment.
 本実施例のデータ分析システムは、機械学習における入力データ及び出力データの関係性を分析する計算機であり、入力装置101、出力装置102、表示装置103、処理装置104、及び記憶装置111を有する。 The data analysis system of this embodiment is a computer that analyzes the relationship between input data and output data in machine learning, and includes an input device 101, an output device 102, a display device 103, a processing device 104, and a storage device 111.
 入力装置101は、キーボードやマウスなどであり、ユーザからの入力を受けるインターフェースである。出力装置102は、プリンタなどであり、プログラムの実行結果をユーザが視認可能な形式で出力するインターフェースである。表示装置103は、液晶表示装置などのディスプレイ装置であり、プログラムの実行結果をユーザが視認可能な形式で出力するインターフェースである。なお、データ分析システムにネットワークを介して接続された端末が入力装置101と出力装置102と表示装置103とを提供してもよい。 The input device 101 is a keyboard, a mouse, or the like, and is an interface that receives input from the user. The output device 102 is a printer or the like, and is an interface that outputs the execution result of the program in a format that can be visually recognized by the user. The display device 103 is a display device such as a liquid crystal display device, and is an interface that outputs the execution result of the program in a format that can be visually recognized by the user. A terminal connected to the data analysis system via a network may provide the input device 101, the output device 102, and the display device 103.
 処理装置104は、プログラムを実行するプロセッサ(演算装置)及びプログラムやデータを格納するメモリによって構成される。具体的には、プロセッサがプログラムを実行することによって、入力処理部106、特徴ノード算出部107、スコア算出部108、ノードマッピング部109、及び出力処理部110が実現される。なお、プロセッサがプログラムを実行して行う処理の一部を、他の演算装置(例えば、FPGA)で実行してもよい。 The processing device 104 includes a processor (arithmetic device) that executes a program and a memory that stores the program and data. Specifically, the input processing unit 106, the feature node calculation unit 107, the score calculation unit 108, the node mapping unit 109, and the output processing unit 110 are realized by the processor executing the program. Note that a part of processing performed by the processor executing the program may be executed by another arithmetic device (for example, FPGA).
 メモリは、不揮発性の記憶素子であるROM及び揮発性の記憶素子であるRAMを含む。ROMは、不変のプログラム(例えば、BIOS)などを格納する。RAMは、DRAM(Dynamic Random Access Memory)のような高速かつ揮発性の記憶素子であり、プロセッサ11が実行するプログラム及びプログラムの実行時に使用されるデータを一時的に格納する。 The memory includes a ROM that is a nonvolatile storage element and a RAM that is a volatile storage element. The ROM stores an immutable program (for example, BIOS). The RAM is a high-speed and volatile storage element such as a DRAM (Dynamic Random Access Memory), and temporarily stores a program executed by the processor 11 and data used when the program is executed.
 記憶装置111は、例えば、磁気記憶装置(HDD)、フラッシュメモリ(SSD)等の大容量かつ不揮発性の記憶装置である。記憶装置111は、処理装置104がプログラムの実行時に使用するデータ及び処理装置104が実行するプログラムを格納する。具体的には、記憶装置111は、入力データテーブル112、正規化情報テーブル113、分割条件テーブル114、ノード情報テーブル115、ノード距離テーブル116、スコアテーブル117及び加重平均スコアテーブル118などの一連の処理に必要なデータ及び出力結果を格納する。なお、プログラムは、記憶装置111から読み出されて、メモリにロードされて、プロセッサによって実行される。 The storage device 111 is a large-capacity non-volatile storage device such as a magnetic storage device (HDD) or a flash memory (SSD). The storage device 111 stores data used by the processing device 104 when executing the program and a program executed by the processing device 104. Specifically, the storage device 111 includes a series of processes such as an input data table 112, a normalization information table 113, a division condition table 114, a node information table 115, a node distance table 116, a score table 117, and a weighted average score table 118. The necessary data and output results are stored. The program is read from the storage device 111, loaded into the memory, and executed by the processor.
 データ分析システムは、所定のプロトコルに従って、他の装置との通信を制御する通信インターフェースを有してもよい。 The data analysis system may have a communication interface that controls communication with other devices according to a predetermined protocol.
 処理装置104が実行するプログラムは、リムーバブルメディア(CD-ROM、フラッシュメモリなど)又はネットワークを介してデータ分析システムに提供され、非一時的記憶媒体である不揮発性の記憶装置111に格納される。このため、データ分析システムは、リムーバブルメディアからデータを読み込むインターフェースを有するとよい。 The program executed by the processing device 104 is provided to the data analysis system via a removable medium (CD-ROM, flash memory, etc.) or a network, and is stored in the nonvolatile storage device 111 that is a non-temporary storage medium. For this reason, the data analysis system may have an interface for reading data from a removable medium.
 データ分析システムは、物理的に一つの計算機上で、又は、論理的又は物理的に構成された複数の計算機上で構成される計算機システムであり、複数の物理的計算機資源上に構築された仮想計算機上で動作してもよい。 A data analysis system is a computer system that is configured on a single physical computer or a plurality of logically or physically configured computers, and is a virtual system constructed on a plurality of physical computer resources. It may operate on a computer.
 図2は、本実施例のデータ分析システムのデータ構造を示す図である。 FIG. 2 is a diagram showing the data structure of the data analysis system of this embodiment.
 入力データテーブル112は、機械学習モデルの学習データを、本実施例のデータ分析システムによる一連の処理で利用する形式に加工したデータを格納し、説明変数1~d(201)及び目的変数1~k(202)を含む。 The input data table 112 stores data obtained by processing the learning data of the machine learning model into a format used in a series of processes by the data analysis system of the present embodiment, and includes the explanatory variables 1 to d (201) and the objective variables 1 to k (202) is included.
 説明変数1~d(201)は、機械学習モデルの入力データであるd次元ベクトルを表している。但し、機械学習では変数ごとにデータを正規化することが多い。本実施例ではこの正規化されたデータを、正規化情報テーブル113を用いて、もとの数値データに戻して格納する。また、機械学習モデルの学習データが時系列である場合、変数名xに対して、x_t0,x_t1,x_t2,...のように各時点の値での変数名として平坦化できる。この場合、説明変数201の次元数と機械学習モデルの入力次元数は一致せず、本実施例の入力データ形式でデータを機械学習モデルに入力する際には、その都度データ形式を変換する。目的変数1~k(202)は、機械学習モデルの出力結果であるk次元ベクトルである。 The explanatory variables 1 to d (201) represent d-dimensional vectors that are input data of the machine learning model. However, in machine learning, data is often normalized for each variable. In the present embodiment, the normalized data is stored in the original numerical data using the normalization information table 113. Further, when the learning data of the machine learning model is time series, x_t0, x_t1, x_t2,. . . It can be flattened as a variable name at the value at each time point. In this case, the number of dimensions of the explanatory variable 201 and the number of input dimensions of the machine learning model do not match. When data is input to the machine learning model in the input data format of this embodiment, the data format is converted each time. The objective variables 1 to k (202) are k-dimensional vectors that are output results of the machine learning model.
 正規化情報テーブル113は、機械学習モデルの学習時に行った正規化処理に関する情報を格納し、変数ID203、変数名204、データ型205、平均206、標準偏差207及びモデル用データ形式対応情報208のデータを含む。 The normalization information table 113 stores information related to normalization processing performed during learning of the machine learning model. The variable ID 203, the variable name 204, the data type 205, the average 206, the standard deviation 207, and the model data format correspondence information 208 are stored. Contains data.
 変数ID203は、説明変数201の要素を特定するインデクスである。変数名204は、当該説明変数の名前である。データ型205は、当該説明変数のデータ型(例えば、論理型、整数型、浮動小数点型など)である。 The variable ID 203 is an index that identifies the element of the explanatory variable 201. The variable name 204 is the name of the explanatory variable. The data type 205 is the data type of the explanatory variable (for example, logical type, integer type, floating point type, etc.).
 平均206及び標準偏差207は、機械学習モデルの学習時の正規化処理で用いた平均と標準偏差を格納する。但し、変数が論理型の場合など、正規化処理を行わない変数に対しては、平均を0、標準偏差を1などと設定するとよい。モデル用データ形式対応情報209は、機械学習モデルの入力形式と本実施例のデータ分析システムで扱う入力形式が異なる場合に、その形式を相互変換するための情報を格納する。例えば、時系列を含むデータの場合、変数xをx_t0,x_t1,...と展開するので、展開前のインデクスと展開後のインデクスとの対応関係を記述しておくことで、相互変換が可能となる。 The average 206 and standard deviation 207 store the average and standard deviation used in the normalization process when learning the machine learning model. However, for variables that are not subjected to normalization processing, such as when the variables are logical, it is preferable to set the average to 0, the standard deviation to 1, and the like. The model data format correspondence information 209 stores information for mutual conversion when the input format of the machine learning model is different from the input format handled by the data analysis system of this embodiment. For example, in the case of data including a time series, the variable x is set to x_t0, x_t1,. . . Therefore, mutual conversion is possible by describing the correspondence between the index before expansion and the index after expansion.
 分割条件テーブル114は、特徴ノード算出部107が入力データテーブル112を分割する条件を格納し、条件ID209、分割条件210、データ数211、マップサイズ212及び集計フラグ1~k(213)のデータを含む。 The division condition table 114 stores conditions for the feature node calculation unit 107 to divide the input data table 112, and the condition ID 209, the division condition 210, the number of data 211, the map size 212, and the data of the aggregation flags 1 to k (213) are stored. Including.
 条件ID209は、分割条件テーブル114に記録される条件を識別するための識別情報である。分割条件210は、入力データを分割して1組のデータセットを得るための条件である。例えば、SQLのselect文のような文字列でもよい。分割条件210には、説明変数に対する特定の値又は範囲や、目的変数に対する値の条件の組み合わせを記述されてもよい。データ数211は、当該分割条件によって選択された入力データ中のデータ数である。 The condition ID 209 is identification information for identifying a condition recorded in the division condition table 114. The division condition 210 is a condition for dividing the input data to obtain one data set. For example, a character string such as an SQL select statement may be used. The division condition 210 may describe a specific value or range for the explanatory variable, or a combination of values for the objective variable. The number of data 211 is the number of data in the input data selected according to the division condition.
 マップサイズ212は、特徴ノード算出部107が、図4のノードベクトル算出ステップ403で使用するマップサイズを格納する。又は、マップサイズを自動設定する場合には、マップサイズ212の値をNULLなどとしておき、自動設定の結果を格納してもよい。集計フラグ1~k(213)は、図5の目的スコア集計処理504で使用する目的変数に対するk個のフラグ配列である。この配列で1が設定されている目的変数に対するスコアのみを集計し、目的スコアとする。例えば、会員管理システムにおいて現在ランクからのランクアップを目的とした分析では、各現在ランクを分割条件として設定し、現在ランクより上位の予測ランクに対応する目的変数のフラグを1に設定する。 The map size 212 stores the map size used by the feature node calculation unit 107 in the node vector calculation step 403 of FIG. Alternatively, when the map size is automatically set, the value of the map size 212 may be set to NULL or the like, and the result of the automatic setting may be stored. Aggregation flags 1 to k (213) are k flag arrays for objective variables used in the objective score aggregation process 504 in FIG. Only the scores for the objective variables for which 1 is set in this array are aggregated to obtain the objective score. For example, in the analysis aiming at rank-up from the current rank in the member management system, each current rank is set as a division condition, and the flag of the objective variable corresponding to the predicted rank higher than the current rank is set to 1.
 ノード情報テーブル115は、特徴ノード算出部107による特徴ノード算出結果を格納し、条件ID214、ノードID215、ヒット数216、ヒット率217、座標218、説明変数1~d(219)及び目的変数1~k(220)のデータを含む。 The node information table 115 stores the result of the feature node calculation by the feature node calculation unit 107, and includes a condition ID 214, a node ID 215, a hit number 216, a hit rate 217, coordinates 218, explanatory variables 1 to d (219), and objective variables 1 to k (220) data are included.
 条件ID214は、分割条件テーブル114に記録される条件を識別するための識別情報(条件ID209)である。ノードID215は、条件ID214によって特定される条件を満たすノードの識別情報である。ヒット数216は、ノードID215で特定されたノードについて、分割条件によって分割されたデータセットのうち、当該ノードが他ノードより近い距離にあるデータ数である。ヒット率217は、ヒット数216をデータ数211で除した値である。座標218は、図3に示すノードマッピング処理304の処理結果である。 The condition ID 214 is identification information (condition ID 209) for identifying a condition recorded in the division condition table 114. The node ID 215 is node identification information that satisfies the condition specified by the condition ID 214. The number of hits 216 is the number of data of the node identified by the node ID 215 in the data set divided by the division condition at a distance closer to the node than the other nodes. The hit rate 217 is a value obtained by dividing the hit number 216 by the data number 211. The coordinates 218 are the processing result of the node mapping process 304 shown in FIG.
 説明変数1~d(219)及び目的変数1~k(220)は、入力データテーブル112と同形式のデータであり、入力データセットに対して、その分布構造の特徴を表すノードベクトルである。このベクトルは入力データテーブルに含まれるデータと一致するものが存在する必要はなく、また、データ型205に指定された型に従わなくてもよい。例えば、論理値や整数値が指定されても、浮動小数点型のデータとして格納できる。 The explanatory variables 1 to d (219) and the objective variables 1 to k (220) are data in the same format as the input data table 112, and are node vectors representing the characteristics of the distribution structure for the input data set. This vector does not have to match the data included in the input data table, and does not have to follow the type specified in the data type 205. For example, even if a logical value or an integer value is specified, it can be stored as floating point type data.
 ノード距離テーブル116は、ノード情報テーブル115の説明変数219、又は説明変数219に目的変数220を加えたノードベクトルについて、各ノード間の距離を格納し、ノードfrom221、ノードto222及び距離223のデータを含む。 The node distance table 116 stores the distance between nodes for the explanatory variable 219 of the node information table 115 or the node vector obtained by adding the objective variable 220 to the explanatory variable 219, and stores the data of the node from 221, the node to 222, and the distance 223. Including.
 ノードfrom221及びノードto222は、それぞれノード情報テーブル115に記録されるノードを特定するための識別情報である。ノードfrom221及びノードto222の値は、条件ID214とノードID215の組でもよいし、ノード情報テーブル115上のindexでもよい。距離223は、ノードfrom221とノードto222の間のノードベクトルの距離である。 The node from 221 and the node to 222 are identification information for specifying the nodes recorded in the node information table 115, respectively. The values of the node from 221 and the node to 222 may be a set of the condition ID 214 and the node ID 215, or may be an index on the node information table 115. A distance 223 is a distance of a node vector between the node from 221 and the node to 222.
 なお、ノード距離テーブル116は二次元配列として表現してもよい。この場合、行及び列にはノード情報テーブル115のindexを用いる。 Note that the node distance table 116 may be expressed as a two-dimensional array. In this case, the index of the node information table 115 is used for the row and the column.
 スコアテーブル117は、スコア算出部108の算出結果を格納し、目的変数ID224、条件ID225、ノードID226及び説明変数1~dのスコア227のデータを含む。 The score table 117 stores the calculation result of the score calculation unit 108, and includes data of the objective variable ID 224, the condition ID 225, the node ID 226, and the score 227 of the explanatory variables 1 to d.
 目的変数ID224は、機械学習モデルの出力結果におけるk次元ベクトルの要素番号を格納する。条件ID225及びノードID226は、ノード情報テーブル115に記録されたノードを特定するための識別情報であり、ノード情報テーブル115の条件ID214及びノードID215と共通の値を用いる。説明変数1~dのスコア227は、スコア算出部108の算出結果であり、目的変数ID224、条件ID225、ノードID226及び説明変数ごとに格納する。 The objective variable ID 224 stores the element number of the k-dimensional vector in the output result of the machine learning model. The condition ID 225 and the node ID 226 are identification information for specifying the node recorded in the node information table 115, and use values common to the condition ID 214 and the node ID 215 of the node information table 115. The score 227 of the explanatory variables 1 to d is a calculation result of the score calculation unit 108, and is stored for each objective variable ID 224, condition ID 225, node ID 226, and explanatory variable.
 スコアテーブル117は、図5で説明する目的スコア及び加重平均スコアも格納する。目的スコアは、目的変数ID224に-1などを設定し、加重平均スコアは、目的変数ID224及びノードID226に-1などを設定し、目的変数とノードが特定のものを識別していないことを表すものである。 The score table 117 also stores a target score and a weighted average score described in FIG. The objective score is set to −1 or the like in the objective variable ID 224, and the weighted average score is set to −1 or the like in the objective variable ID 224 and the node ID 226, indicating that the objective variable and the node do not identify a specific thing. Is.
 加重平均スコアテーブル118は、目的変数224とノードID226が-1のように特定されない形で、分割条件と説明変数ごとのスコアを格納している。具体的には、加重平均スコアテーブル118は、後述するスコア算出処理303(図5)のステップ505で算出された加重平均スコアを分割条件ごとに分け、各説明変数のスコアを絶対値の降順にソートし、変数名とともに列挙したリストである。加重平均スコアテーブル118は、出力処理305(図7)のステップ701で生成される。加重平均スコアテーブル118によって、ユーザは、各分割条件が表すターゲット層ごとに、影響度が高い説明変数を容易に把握でき、分割条件での説明変数の順位及びスコアを比較できる。例えば、条件1、2では属性Aの影響度が大きく、条件3、4では属性Jの影響度が大きい。また、属性Iのスコアは符号が逆になっており、同一の施策を適用すると効果が逆に現れる可能性がある。このように、各条件が示すターゲット層への施策立案に活用できる。 The weighted average score table 118 stores the division condition and the score for each explanatory variable in such a manner that the objective variable 224 and the node ID 226 are not specified as -1. Specifically, the weighted average score table 118 divides the weighted average score calculated in step 505 of the score calculation process 303 (FIG. 5) described later for each division condition, and sets the scores of each explanatory variable in descending order of absolute values. A list that is sorted and listed with variable names. The weighted average score table 118 is generated in step 701 of the output process 305 (FIG. 7). The weighted average score table 118 allows the user to easily grasp the explanatory variable having a high influence degree for each target layer represented by each division condition, and compare the rank and score of the explanatory variable under the division condition. For example, in the conditions 1 and 2, the influence degree of the attribute A is large, and in the conditions 3 and 4, the influence degree of the attribute J is large. In addition, the score of the attribute I has the opposite sign, and if the same measure is applied, the effect may appear in reverse. In this way, it can be used for planning measures for the target layer indicated by each condition.
 図3は、本実施例の全体処理のフローチャートである。 FIG. 3 is a flowchart of the overall processing of this embodiment.
 まず、入力処理部106が入力処理を実行する(301)。例えば、入力処理部106は、正規化情報テーブル113を参照して、機械学習モデルの学習データを、その入力形式から本実施例の入力形式に変換し、正規化された数値を元に戻す処理を実行し、その結果を入力データテーブル112に格納する。 First, the input processing unit 106 executes input processing (301). For example, the input processing unit 106 refers to the normalization information table 113, converts the learning data of the machine learning model from the input format to the input format of the present embodiment, and returns the normalized numerical value to the original value. And the result is stored in the input data table 112.
 次に、特徴ノード算出部107が特徴ノード算出処理を実行する(302)。例えば、特徴ノード算出部107は、分割条件テーブル114に従って入力データテーブル112を分割し、分割された各データセットから特徴ノードを算出し、結果をノード情報テーブルに格納する。特徴ノード算出処理の詳細は図4で説明する。 Next, the feature node calculation unit 107 executes feature node calculation processing (302). For example, the feature node calculation unit 107 divides the input data table 112 according to the division condition table 114, calculates a feature node from each divided data set, and stores the result in the node information table. Details of the feature node calculation process will be described with reference to FIG.
 次に、スコア算出部108がスコア算出処理を実行する(303)。例えば、スコア算出部108は、説明変数の影響度を表すスコアを算出し、結果をスコアテーブルに格納する。スコア算出処理の詳細は図5で説明する。 Next, the score calculation unit 108 executes a score calculation process (303). For example, the score calculation unit 108 calculates a score representing the degree of influence of the explanatory variable, and stores the result in the score table. Details of the score calculation process will be described with reference to FIG.
 次に、ノードマッピング部109がノードマッピング処理を実行する(304)。例えば、ノードマッピング部109は、ステップ302で得られた特徴ノードを低次元空間へマッピングする。ノードマッピング処理の詳細は図6で説明する。 Next, the node mapping unit 109 executes a node mapping process (304). For example, the node mapping unit 109 maps the feature node obtained in step 302 to the low-dimensional space. Details of the node mapping process will be described with reference to FIG.
 次に、出力処理部110が出力処理を実行し(305)、処理を終了する。出力処理の詳細は図7で説明する。 Next, the output processing unit 110 executes the output process (305) and ends the process. Details of the output process will be described with reference to FIG.
 図4は、本実施例の特徴ノード算出処理302のフローチャートである。 FIG. 4 is a flowchart of the feature node calculation process 302 of this embodiment.
 まず、特徴ノード算出部107は、変数pを1から分割条件テーブル114のデータ件数でループする(401)。以降、p番目の分割条件についてステップ402からステップ405の処理を実行する。 First, the feature node calculation unit 107 loops the variable p from 1 to the number of data in the division condition table 114 (401). Thereafter, the processing from step 402 to step 405 is executed for the p-th division condition.
 次に、特徴ノード算出部107は、データ分割処理を行う(402)。例えば、p番目の分割条件の分割条件210を満たすデータを入力データテーブル112から選択する。選択されたデータセットは、正規化情報テーブルを用いて正規化処理を施される。 Next, the feature node calculation unit 107 performs data division processing (402). For example, data that satisfies the division condition 210 of the p-th division condition is selected from the input data table 112. The selected data set is subjected to normalization processing using a normalization information table.
 次に、特徴ノード算出部107は、ノードベクトル算出を行う(403)。例えば、k-平均法に代表されるクラスタリング手法などによって、選択されたデータセットの分布構造を考慮し、より少ないノード数でその特徴を表すノードベクトルを算出する。本実施例では、自己組織化マップ(以下、SOMと略す)を適用する。SOMは、格子状に配置されたノードと、隣接するノードとの間を連結するエッジで表現されるニューラルネットワークの一種である。各ノードには、入力データと同形式の参照ベクトルが割り当てられる。参照ベクトルは、SOMの学習データと距離が最も近いノード(以下、BMU(Best Matching Unit)と略す)の参照ベクトルと共に、BMUに連結したノードの参照ベクトルも、学習データに近づくように更新する。SOMは公知の手法であるため、手法の詳細な説明は省略する。この処理を繰り返すことによって、学習データの複雑な分布構造を、ノードの幾何学的構造に写像できる。 Next, the feature node calculation unit 107 calculates a node vector (403). For example, a node vector representing the feature is calculated with a smaller number of nodes in consideration of the distribution structure of the selected data set by a clustering method typified by the k-average method. In this embodiment, a self-organizing map (hereinafter abbreviated as SOM) is applied. The SOM is a kind of neural network expressed by edges connecting nodes arranged in a grid and adjacent nodes. Each node is assigned a reference vector in the same format as the input data. The reference vector is updated so that the reference vector of the node connected to the BMU is approximated to the learning data together with the reference vector of the node having the closest distance to the SOM learning data (hereinafter referred to as BMU (Best Matching Unit)). Since SOM is a known technique, a detailed description of the technique is omitted. By repeating this process, it is possible to map the complicated distribution structure of the learning data to the geometric structure of the node.
 SOMの結果として算出される各ノードの参照ベクトルは、説明変数219と目的変数220の形式でノード情報テーブル115に格納される。 The reference vector of each node calculated as a result of the SOM is stored in the node information table 115 in the form of an explanatory variable 219 and a target variable 220.
 なお、SOMを実行する際の学習データの形式は、説明変数のみ、又は説明変数及び目的変数の組によって設定できる。どちらの形式を利用するかは予め設定されているとよい。そして、出力結果としての目的変数220は、これら学習データの入力形式に従う。 Note that the format of the learning data when executing the SOM can be set only by the explanatory variable or by the combination of the explanatory variable and the objective variable. Which format to use is preferably set in advance. And the objective variable 220 as an output result follows the input format of these learning data.
 次に、特徴ノード算出部107は、ノードごとにヒット数を計数する(404)。ここでは、ステップ403で算出したノードごとに、それをBMUとする選択データセット中のデータ数をヒット数216の値として算出する。ヒット率217はこれを選択されたデータ件数で割って算出する。 Next, the feature node calculation unit 107 counts the number of hits for each node (404). Here, for each node calculated in step 403, the number of data in the selected data set with BMU as the BMU is calculated as the value of the hit number 216. The hit rate 217 is calculated by dividing this by the number of selected data items.
 次に、特徴ノード算出部107は、算出された結果をデータの保存領域に格納する(405)。このとき、ステップ403で算出されたノードベクトルは正規化されているため、正規化情報テーブル113を用いて元に戻す処理を行い、その結果を格納する。 Next, the feature node calculation unit 107 stores the calculated result in the data storage area (405). At this time, since the node vector calculated in step 403 has been normalized, the normalization information table 113 is used to restore the original, and the result is stored.
 そして、ステップ401からステップ405のループが終了すると特徴ノード算出処理302を終了する。 Then, when the loop from step 401 to step 405 ends, the feature node calculation process 302 ends.
 図5は、本実施例のスコア算出処理303のフローチャートである。 FIG. 5 is a flowchart of the score calculation process 303 of this embodiment.
 まず、スコア算出部108は、変数iを1からノード情報テーブル115のデータ件数でループする(501)。以降、i番目のノードについてステップ502からステップ504の処理を実行する。 First, the score calculation unit 108 loops the variable i from 1 to the number of data items in the node information table 115 (501). Thereafter, the processing from step 502 to step 504 is executed for the i-th node.
 次に、スコア算出部108は、ノードiの近傍データセットと、それに対する機械学習モデルの予測結果を生成する(502)。近傍データとは、変数iで指定されたノードの説明変数が表すd次元ベクトルの周辺に位置するベクトルデータである。本実施例では近傍データの生成方法として、ノードiの説明変数の値を平均とし、正規化情報の標準偏差の2分の1を標準偏差とした正規分布に従った乱数によって生成する方法を用いるが、他の生成方法を用いてもよい。近傍データセットのデータ件数は予め指定されているとよい。機械学習モデルによる予測は、正規化情報テーブルを用いた正規化と、モデル用データ形式対応情報208による変換を行って実行できる。 Next, the score calculation unit 108 generates a neighborhood data set of the node i and a prediction result of the machine learning model corresponding thereto (502). The neighborhood data is vector data located around the d-dimensional vector represented by the explanatory variable of the node specified by the variable i. In this embodiment, as a method for generating the neighborhood data, a method is used in which the value of the explanatory variable of the node i is averaged, and is generated by a random number according to a normal distribution in which half the standard deviation of the normalized information is a standard deviation. However, other generation methods may be used. The number of data items in the neighborhood data set may be designated in advance. Prediction using a machine learning model can be executed by performing normalization using a normalization information table and conversion using model data format correspondence information 208.
 次に、スコア算出部108は、生成された近傍データセットと機械学習モデルの予測結果について局所モデル推定処理を行う(503)。ステップ503では、近傍データについて説明変数と目的変数との関係性を表すスコアを得る。本実施例では近傍データセットと機械学習モデルの予測結果に対して線形モデル推定を適用し、その推定パラメータをスコアとする。すなわち、d次元の説明変数X=(x1,x2,…,xd)に対する機械学習モデルの出力結果Yを、下式で表される線形モデルで近似し、推定パラメータSiを入力xiにおけるスコアとする。ここで、Y,Y,Si,Cはk次元ベクトルである。線形モデル推定の手法は公知の技術であるため、手法の詳細な説明は省略する。 Next, the score calculation unit 108 performs local model estimation processing on the generated neighborhood data set and the prediction result of the machine learning model (503). In step 503, a score representing the relationship between the explanatory variable and the objective variable is obtained for the neighborhood data. In this embodiment, linear model estimation is applied to the prediction results of the neighborhood data set and the machine learning model, and the estimation parameter is used as a score. That is, the output result Y of the machine learning model for the d-dimensional explanatory variable X = (x1, x2,..., Xd) is approximated by a linear model expressed by the following equation, and the estimated parameter Si is used as the score at the input xi. . Here, Y, Y, Si, and C are k-dimensional vectors. Since the linear model estimation technique is a known technique, a detailed description of the technique is omitted.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 次に、スコア算出部108は、ステップ503で得られたスコアを、集計フラグ213に従って集計して目的スコアを算出する(504)。具体的には、フラグが1の要素のスコアを説明変数ごとに集計する。 Next, the score calculation unit 108 calculates the target score by counting the scores obtained in step 503 according to the counting flag 213 (504). Specifically, the scores of the elements whose flag is 1 are totaled for each explanatory variable.
 そして、スコア算出部108は、ステップ501からステップ504のループが終了すると、ヒット率217を目的スコアに適用して加重平均スコアを算出する(505)。加重平均スコアは、同一条件IDの全ノードについて、説明変数ごとに算出される。 Then, when the loop from step 501 to step 504 ends, the score calculation unit 108 calculates the weighted average score by applying the hit rate 217 to the target score (505). The weighted average score is calculated for each explanatory variable for all nodes having the same condition ID.
 そして、スコア算出部108は、算出された結果をデータの保存領域に格納し(506)、スコア算出処理を終了する。 The score calculation unit 108 stores the calculated result in the data storage area (506), and ends the score calculation process.
 図6は、本実施例のノードマッピング処理304のフローチャートである。本実施例では、多次元尺度構成法(以下、MDSと略す)を使って格子状の平面SOMノードの分割条件ごとのセットを2次元座標にマッピングするが、ノードの幾何学的構造やマッピングする空間は他の次元数の空間でもよい。 FIG. 6 is a flowchart of the node mapping process 304 of this embodiment. In the present embodiment, a multidimensional scaling method (hereinafter abbreviated as MDS) is used to map a set of grid-like planar SOM nodes for each division condition to two-dimensional coordinates. The space may be a space with other dimensions.
 MDSは、多次元ベクトル空間上のノードを、2次元や3次元などの低次元空間にマッピングする手法の一つで、ノード間の距離を可能な限り再現するようにマッピングを行う。MDSは公知の手法であるため、手法の詳細な説明は省略する。本実施例では、MDSを適用する際に、SOMノードの幾何学的構造を考慮した初期化を行う。 MDS is a technique for mapping nodes in a multidimensional vector space to a low-dimensional space such as 2D or 3D, and performs mapping so that the distance between nodes is reproduced as much as possible. Since MDS is a known technique, a detailed description of the technique is omitted. In this embodiment, when MDS is applied, initialization is performed in consideration of the geometric structure of the SOM node.
 まず、ノードマッピング部109は、ノード距離テーブル116を生成する(601)。本実施例では、各特徴ノードベクトルを、正規化された説明変数219とし、ユークリッド距離によって距離テーブルを生成する。 First, the node mapping unit 109 generates a node distance table 116 (601). In this embodiment, each feature node vector is set as a normalized explanatory variable 219, and a distance table is generated based on the Euclidean distance.
 次に、ノードマッピング部109は、各変数を初期化する(602)。具体的には、まずlt、lb、rt、rbを、それぞれ格子状のSOMノードの構造における左上、左下、右上、右下のノードindexとして定義し、全て-1を設定する。次に、yを0に設定する。次に、配列Posを、各ノードの座標を格納する配列として定義する。そして、Sw、Shを、それぞれx方向、y方向のノード座標配列として定義する。この配列サイズはマップサイズ212によって決定される。Pos、Sw、Shの要素は全て0で初期化する。 Next, the node mapping unit 109 initializes each variable (602). Specifically, first, lt, lb, rt, and rb are respectively defined as upper left, lower left, upper right, and lower right node indexes in the lattice-like SOM node structure, and all are set to -1. Next, y is set to 0. Next, the array Pos is defined as an array for storing the coordinates of each node. Then, Sw and Sh are defined as node coordinate arrays in the x direction and the y direction, respectively. This array size is determined by the map size 212. All elements of Pos, Sw, and Sh are initialized with 0.
 次に、ノードマッピング部109は、変数pを1から分割条件テーブル114のデータ件数でループする(603)。以降、p番目の分割条件についてステップ604からステップ609の処理を実行する。 Next, the node mapping unit 109 loops the variable p from 1 to the number of data items in the division condition table 114 (603). Thereafter, the processing from step 604 to step 609 is executed for the p-th division condition.
 次に、ノードマッピング部109は、rbが0以上であれば(604でYes)、yに配列Sh内の最大値に所定数(例えば、2)を加算した数を入力する(605)。所定数は適切な値に変更してもよい。 Next, if rb is 0 or more (Yes in 604), the node mapping unit 109 inputs a number obtained by adding a predetermined number (eg, 2) to the maximum value in the array Sh to y (605). The predetermined number may be changed to an appropriate value.
 一方、ノードマッピング部109は、rbが負の数であれば(604でNo)、何もせずにステップ606に進む。 On the other hand, if rb is a negative number (No in 604), the node mapping unit 109 proceeds to step 606 without doing anything.
 次に、分割条件pのノードに対する四隅ノードindexを、それぞれlt,lb,rt,rbに設定する(606)。このとき、ltをrb+1に設定し、マップサイズ212に従って残りの変数を設定できる。 Next, the four corner nodes index for the node of the division condition p are set to lt, lb, rt, and rb, respectively (606). At this time, it is possible to set lt to rb + 1 and set the remaining variables according to the map size 212.
 次に、ノードマッピング部109は、Sw、Shに値を設定する(607)。本実施例では、ノードltとrtとの距離、ltとlbとの距離を、マップサイズに従って均等分割した値を設定する。 Next, the node mapping unit 109 sets values for Sw and Sh (607). In this embodiment, a value obtained by equally dividing the distance between the nodes lt and rt and the distance between lt and lb according to the map size is set.
 次に、ノードマッピング部109は、Shの各要素にyを加算する(608)。x軸方向に移動したい場合、変数xを定義して、yと同様の処理をSwに適用すればよい。 Next, the node mapping unit 109 adds y to each element of Sh (608). If it is desired to move in the x-axis direction, a variable x may be defined and the same processing as y may be applied to Sw.
 次に、ノードマッピング部109は、ノードlt~rbの座標をPosに設定する(609)。この処理は、例えば、SOMのノード構造においてi行j列の位置のノードの座標を、(Sw[i],Sh[j])で設定するとよい。 Next, the node mapping unit 109 sets the coordinates of the nodes lt to rb to Pos (609). In this process, for example, the coordinates of the node at the position of i row and j column in the SOM node structure may be set by (Sw [i], Sh [j]).
 そして、ステップ603からステップ608のループが終了すると、Posをノードの初期座標としてMDSを適用する(610)。 Then, when the loop from step 603 to step 608 ends, MDS is applied with Pos as the initial coordinates of the node (610).
 次に、ノードマッピング部109は、結果を保存領域に格納し(611)、ノードマッピング処理を終了する。 Next, the node mapping unit 109 stores the result in the storage area (611), and ends the node mapping process.
 図7は、本実施例の出力処理305のフローチャートである。 FIG. 7 is a flowchart of the output process 305 of this embodiment.
 まず、出力処理部110は、加重平均スコアを列挙して、加重平均スコアテーブル118を生成する(701)。加重平均スコアテーブル118は、前述したように、分割条件ごとに加重平均スコアを分け、各説明変数のスコアを絶対値の降順にソートし、変数名ともに列挙したものである。 First, the output processing unit 110 lists the weighted average scores and generates the weighted average score table 118 (701). As described above, the weighted average score table 118 divides the weighted average score for each division condition, sorts the scores of each explanatory variable in descending order of absolute values, and lists the variable names.
 次に、出力処理部110は、ノードベクトルの成分マップを表示する(702)。成分マップは、同一条件における各ノードの特定の説明変数319又は目的変数220の値を、SOMのノードの幾何学的構造とマップサイズによって可視化したものである。例えば、マップサイズがm×nのときの説明変数iの成分マップは、ノード情報テーブルの同一条件IDの全ノードにおける説明変数iの値を、その値に対応した色でm×nの画像として表示する。 Next, the output processing unit 110 displays a component map of the node vector (702). The component map is obtained by visualizing the value of the specific explanatory variable 319 or the objective variable 220 of each node under the same conditions by the geometric structure of the SOM node and the map size. For example, in the component map of the explanatory variable i when the map size is m × n, the value of the explanatory variable i in all the nodes having the same condition ID in the node information table is converted into an m × n image in a color corresponding to the value. indicate.
 本実施例の成分マップは、図8に例示するように、特定の分割条件に対して、説明変数219ごとに、ノードの幾何学的構造に基づいて、説明変数319の値を画像化している。また、ステップ403において、目的変数202も加えたベクトルに対する処理を行った場合には、目的変数220を用いた成分マップも表示可能である。成分マップによって、各説明変数間の相関や、説明変数と目的変数との間の相関関係などを視覚的に把握できる。 As illustrated in FIG. 8, the component map of the present embodiment images the value of the explanatory variable 319 based on the geometric structure of the node for each explanatory variable 219 with respect to a specific division condition. . In addition, in step 403, when processing is performed on the vector including the objective variable 202, a component map using the objective variable 220 can also be displayed. With the component map, the correlation between each explanatory variable and the correlation between the explanatory variable and the objective variable can be visually grasped.
 次に、出力処理部110は、ヒットマップを表示する(703)。ヒットマップは、ステップ702の可視化手法を用いて、ヒット数216(又はその対数)又はヒット率217を可視化したものである。 Next, the output processing unit 110 displays a hit map (703). The hit map is obtained by visualizing the hit number 216 (or its logarithm) or the hit rate 217 using the visualization method of step 702.
 本実施例のヒットマップは、図9に例示するように、ヒット率217の対数に基づいた色分けによってヒット数を画像化している。また、図のようにヒット数の数値を表示してもよい。ヒットマップによって、学習データの分布において密度の濃いノードなどを把握できる。 In the hit map of the present embodiment, the number of hits is imaged by color coding based on the logarithm of the hit rate 217 as illustrated in FIG. Further, as shown in the figure, the numerical value of the hit number may be displayed. With the hit map, it is possible to grasp the dense nodes in the learning data distribution.
 次に、出力処理部110は、スコアマップを表示する(704)。スコアマップは、ステップ702の可視化手法を用いて、特定の説明変数に対するスコア227又は目的スコアを可視化したものである。 Next, the output processing unit 110 displays a score map (704). The score map is obtained by visualizing the score 227 or the objective score for a specific explanatory variable using the visualization method in step 702.
 本実施例のスコアマップは、図10に例示するように、説明変数ごとのスコア227に基づいた色分けによってスコア227を画像化している。例えば、スコアが0の場合を緑に設定し、プラス方向に赤、マイナス方向に青へと段階的に変化する色分けを行うことで、どのノード位置で、どの説明変数の影響度が強いかを容易に把握できる。また、図のように、該当する説明変数の成分マップと模様を比較することで、影響度が高いノードにおける説明変数の値の様子を把握できる。 In the score map of this embodiment, the score 227 is imaged by color coding based on the score 227 for each explanatory variable, as illustrated in FIG. For example, if the score is 0, the color is set to green, and the color change gradually changes to red in the positive direction and blue in the negative direction. Easy to grasp. Further, as shown in the figure, by comparing the pattern with the component map of the corresponding explanatory variable, it is possible to grasp the state of the value of the explanatory variable at the node having a high influence level.
 次に、出力処理部110は、ノードマップを表示する(705)。ノードマップは、ステップ304で算出したノードごとの座標218によって、各ノードを低次元空間上の点として可視化したものである。このとき、各ノードを表す点の形や色などは、ノード情報テーブルの説明変数の値、目的変数の値、スコアテーブルの説明変数ごとのスコア、目的スコア、分割条件などによって設定するとよい。 Next, the output processing unit 110 displays a node map (705). The node map is obtained by visualizing each node as a point in a low-dimensional space based on the coordinates 218 for each node calculated in step 304. At this time, the shape and color of a point representing each node may be set according to the value of the explanatory variable in the node information table, the value of the objective variable, the score for each explanatory variable in the score table, the objective score, the division condition, and the like.
 本実施例のノードマップは、図11に例示するように、各分割条件におけるノードの座標218に基づいて、2次元空間にノードをプロットしたものである。また、特定の分割条件におけるノードの幾何学的構造を格子状の線によって表示してもよい。ノードマップによって、複数の分割条件での各ノードの位置関係を把握できる。例えば、現在ランクを分割条件とした場合に、距離が近いノードを見ることで、ランクが上がる可能性や下がるリスクが高そうなノードを容易に探し出せる。それら近隣ノードとの特徴の違いは、ノード情報テーブルの値を直接比較したり、成分マップを用いて比較できる。 The node map of this embodiment is obtained by plotting nodes in a two-dimensional space based on the node coordinates 218 under each division condition, as illustrated in FIG. Further, the geometric structure of the node under a specific division condition may be displayed by a grid line. The node map can grasp the positional relationship of each node under a plurality of division conditions. For example, when the current rank is used as a division condition, by looking at nodes that are close to each other, it is possible to easily find a node that is likely to increase in rank or have a high risk of decreasing. Differences in features from those neighboring nodes can be directly compared with values in the node information table or compared using a component map.
 そして、処理を終了する。 Then, the process ends.
 なお、前述した可視化手法はユーザの指示によって任意の順序で実行可能であり、それらを組み合わせて同時に表示してもよい。 Note that the above-described visualization methods can be executed in an arbitrary order according to a user instruction, and they may be combined and displayed simultaneously.
 以上に説明したように、本実施例のデータ分析システムは、機械学習モデルが学習時に用いた複数の説明変数からなる入力データセット又は前記説明変数が加工されたデータセットからなる入力データセットを、指定された分割条件で分割し、前記分割された各データセットの分布構造の特徴を表す特徴ノードを算出する特徴ノード算出部107と、前記特徴ノードを含む入力データの近傍データを生成し、前記生成された近傍データの説明変数と、前記近傍データを前記機械学習モデルに入力して得られた目的変数のデータとに基づいて、当該説明変数と当該目的変数との関係性を表すスコアを算出するスコア算出部108と、前記スコアを含む出力結果を出力する出力処理部110とを備える。このため、学習済の機械学習モデルに対し、分割条件が示すターゲット層ごとに、説明変数が目的変数に与える影響度を算出し可視化できる。また、分布構造の特徴を表す特徴ノードによって、学習データより少ないデータでデータセットの特徴を表すことができる。また、学習データが少なく、網羅されていなくても、近傍データによってデータセットの特徴を表して、特徴ノードを補完できる。つまり、少ないデータでデータセットの特徴を表して、演算量を低減できる。 As described above, the data analysis system according to the present embodiment is configured such that an input data set including a plurality of explanatory variables used by the machine learning model at the time of learning or an input data set including a data set obtained by processing the explanatory variables. A feature node calculation unit 107 that calculates a feature node that represents a feature of a distribution structure of each of the divided data sets, generates near data of input data including the feature node, and is divided according to a specified division condition, Based on the generated explanatory variable of the neighborhood data and the objective variable data obtained by inputting the neighborhood data to the machine learning model, a score representing the relationship between the explanatory variable and the objective variable is calculated. A score calculation unit 108 that outputs the output result including the score. Therefore, it is possible to calculate and visualize the degree of influence of the explanatory variable on the objective variable for each target layer indicated by the division condition with respect to the learned machine learning model. In addition, the feature node representing the feature of the distribution structure can represent the feature of the data set with less data than the learning data. Moreover, even if there is little learning data and it is not covered, the feature node can be complemented by representing the feature of the data set by the neighborhood data. That is, it is possible to represent the characteristics of the data set with a small amount of data and reduce the amount of calculation.
 また、特徴ノード算出部107は、自己組織化マップが適用された前記入力データセットに基づいて特徴ノードを算出するので、特徴ノードを的確に算出できる。 Further, the feature node calculation unit 107 calculates the feature node based on the input data set to which the self-organizing map is applied, so that the feature node can be accurately calculated.
 また、特徴ノード算出部107は、前記機械学習モデルが学習時に用いた複数の説明変数及び前記機械学習モデルが算出した目的変数からなる入力データセット、又は前記説明変数及び前記目的変数が加工されたデータセットからなる入力データセットを用いて前記特徴ノードを算出するので、目的変数をマップで比較できる。 In addition, the feature node calculation unit 107 is an input data set including a plurality of explanatory variables used by the machine learning model during learning and an objective variable calculated by the machine learning model, or the explanatory variable and the objective variable are processed. Since the feature node is calculated using an input data set consisting of data sets, the objective variables can be compared on a map.
 また、特徴ノード算出部107は、特定の説明変数の特定の値又は範囲、及び前記目的変数の要素の特定の値(例えば、最大値)又は範囲の少なくとも一つを含む分割条件、又はこれらの組み合わせによって表現される分割条件によって前記入力データセットを分割するので、ターゲット層を絞り込んだ分析ができる。すなわち、集団全体ではなく、目的によって属性を変えることによって、特定の属性を有する集団のデータを解析できる。 In addition, the feature node calculation unit 107 includes a division condition including at least one of a specific value or range of a specific explanatory variable and a specific value (for example, maximum value) or range of an element of the objective variable, or these Since the input data set is divided according to the dividing condition expressed by the combination, analysis with the target layer narrowed down can be performed. That is, data of a group having a specific attribute can be analyzed by changing the attribute according to the purpose, not the entire group.
 また、スコア算出部108は、前記説明変数のデータと前記目的変数のデータとに基づいて線形モデル推定を適用することによって、前記説明変数毎に前記目的変数の形式に対応したスコアを算出するので、線形モデルはシンプルで扱いやすいことから、ユーザにとって分かりやすく、結果に対する信頼性を向上できる。特に、線形モデルでは、複数属性を統合する場合に確率の和で計算可能であるため、ユーザが直感的に分かりやすい。 The score calculation unit 108 calculates a score corresponding to the format of the objective variable for each explanatory variable by applying linear model estimation based on the data of the explanatory variable and the data of the objective variable. Because the linear model is simple and easy to handle, it is easy for the user to understand and can improve the reliability of the results. In particular, in a linear model, when a plurality of attributes are integrated, calculation can be performed with the sum of probabilities, so that the user can easily understand intuitively.
 また、スコア算出部108は、前記目的変数中の要素の一部のうち、前記分割条件ごとに指定された部分を集計して目的スコアを算出するので、ターゲット層を絞り込んだ分析ができる。すなわち、集団全体ではなく、目的によって属性を変えることによって、特定の属性を有する集団のデータを解析できる。 In addition, the score calculation unit 108 calculates an objective score by summing up the parts specified for each of the division conditions among some of the elements in the objective variable, so that analysis with a narrowed down target layer can be performed. That is, data of a group having a specific attribute can be analyzed by changing the attribute according to the purpose, not the entire group.
 また、スコア算出部108は、前記算出したスコア及び前記算出した目的スコアについて、前記各分割条件における特徴ノードごとの周辺データの数に基づいて、説明変数ごとに加重平均スコアを算出するので、密度分布を考慮して、データの特性を正しく表せる。 Further, the score calculation unit 108 calculates a weighted average score for each explanatory variable based on the number of peripheral data for each feature node in each division condition for the calculated score and the calculated target score. Considering the distribution, data characteristics can be expressed correctly.
 また、各分割条件において、前記各分割条件において、特徴ノード算出部107によって算出された特徴ノードを二次元空間にマッピングするノードマッピング部部109を備えるので、集団の特性を分かりやすく表すことができる。 In addition, each division condition includes a node mapping unit 109 that maps the feature node calculated by the feature node calculation unit 107 in the two-dimensional space in each division condition, so that the characteristics of the group can be easily expressed. .
 また、ノードマッピング部109は、前記説明変数ごとの特徴ノードの値と、前記算出されたスコアと、前記スコア及び目的スコアについて算出された加重平均スコアとを、ノードの幾何学的構造に基づいて画像化して表示するためのデータを生成するので、ノード間の距離の関係性を維持しつつ、異なる属性の集団間でデータを比較できる。 Further, the node mapping unit 109 calculates the feature node value for each explanatory variable, the calculated score, and the weighted average score calculated for the score and the target score based on the geometric structure of the node. Since data to be displayed as an image is generated, it is possible to compare data between groups of different attributes while maintaining the distance relationship between nodes.
 また、ノードマッピング部109は、前記特徴ノードのベクトル又は目的変数成分を含む特徴ノードのベクトルを、前記分割条件の特徴ノードの幾何学的構造に基づいて初期化した後、多次元尺度構成法を適用してマッピングを行うので、スコアマップによって、影響度が高い属性と低い属性とを分かりやすく表すことができる。 The node mapping unit 109 initializes the feature node vector or the feature node vector including the objective variable component based on the geometric structure of the feature node of the division condition, and then performs a multidimensional scaling method. Since mapping is applied and applied, attributes with high influence and attributes with low influence can be expressed in an easy-to-understand manner using the score map.
 また、入力データセットが、所定時間ごとの説明変数を含む時系列データである場合、当該説明変数を過去のある時点から現時点までの独立した変数として展開したデータを入力データとし、当該展開に用いた規則を格納するので、入力データセットが時系列データであるデータを解析できる。 In addition, when the input data set is time-series data including explanatory variables for each predetermined time, the data obtained by expanding the explanatory variables as independent variables from a certain point in the past to the present time is used as input data and used for the expansion. Since the rules are stored, the input data set can be analyzed as time series data.
 なお、本発明は前述した実施例に限定されるものではなく、添付した特許請求の範囲の趣旨内における様々な変形例及び同等の構成が含まれる。例えば、前述した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに本発明は限定されない。また、ある実施例の構成の一部を他の実施例の構成に置き換えてもよい。また、ある実施例の構成に他の実施例の構成を加えてもよい。また、各実施例の構成の一部について、他の構成の追加・削除・置換をしてもよい。 The present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the spirit of the appended claims. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described. A part of the configuration of one embodiment may be replaced with the configuration of another embodiment. Moreover, you may add the structure of another Example to the structure of a certain Example. In addition, for a part of the configuration of each embodiment, another configuration may be added, deleted, or replaced.
 また、前述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等により、ハードウェアで実現してもよく、プロセッサがそれぞれの機能を実現するプログラムを解釈し実行することにより、ソフトウェアで実現してもよい。 In addition, each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.
 各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、SSD(Solid State Drive)等の記憶装置、又は、ICカード、SDカード、DVD等の記録媒体に格納することができる。 Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
 また、制御線や情報線は説明上必要と考えられるものを示しており、実装上必要な全ての制御線や情報線を示しているとは限らない。実際には、ほとんど全ての構成が相互に接続されていると考えてよい。 Also, the control lines and information lines indicate what is considered necessary for explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Claims (12)

  1.  データ分析システムであって、
     プログラムを実行する演算装置と、前記演算装置と接続された記憶装置とを備え、
     前記演算装置が、機械学習モデルが学習時に用いた複数の説明変数からなる入力データセット又は前記説明変数が加工されたデータセットからなる入力データセットを、指定された分割条件で分割し、前記分割された各データセットの分布構造の特徴を表す特徴ノードを算出する特徴ノード算出部と、
     前記演算装置が、前記特徴ノードを含む入力データの近傍データを生成し、前記生成された近傍データの説明変数と、前記近傍データを前記機械学習モデルに入力して得られた目的変数のデータとに基づいて、当該説明変数と当該目的変数との関係性を表すスコアを算出するスコア算出部と、
     前記演算装置が、前記スコアを含む出力結果を出力する出力処理部とを備えることを特徴とするデータ分析システム。
    A data analysis system,
    An arithmetic device for executing the program, and a storage device connected to the arithmetic device,
    The arithmetic unit divides an input data set composed of a plurality of explanatory variables used by the machine learning model at the time of learning or an input data set composed of a data set obtained by processing the explanatory variables according to a specified division condition, and the division A feature node calculating unit that calculates a feature node representing the characteristics of the distribution structure of each data set;
    The arithmetic unit generates neighboring data of input data including the feature node, and explanatory variables of the generated neighboring data, and data of objective variables obtained by inputting the neighboring data to the machine learning model, A score calculation unit that calculates a score representing the relationship between the explanatory variable and the objective variable,
    The data processing system characterized by the said arithmetic unit being provided with the output process part which outputs the output result containing the said score.
  2.  請求項1に記載のデータ分析システムであって、
     前記特徴ノード算出部は、自己組織化マップが適用された前記入力データセットに基づいて特徴ノードを算出することを特徴とするデータ分析システム。
    The data analysis system according to claim 1,
    The data analysis system, wherein the feature node calculation unit calculates a feature node based on the input data set to which a self-organizing map is applied.
  3.  請求項1に記載のデータ分析システムであって、
     前記特徴ノード算出部は、前記機械学習モデルが学習時に用いた複数の説明変数及び前記機械学習モデルが算出した目的変数からなる入力データセット、又は前記説明変数及び前記目的変数が加工されたデータセットからなる入力データセットを用いて前記特徴ノードを算出することを特徴とするデータ分析システム。
    The data analysis system according to claim 1,
    The feature node calculation unit includes an input data set including a plurality of explanatory variables used by the machine learning model during learning and an objective variable calculated by the machine learning model, or a data set obtained by processing the explanatory variables and the objective variable. A data analysis system characterized in that the feature node is calculated using an input data set consisting of:
  4.  請求項1に記載のデータ分析システムであって、
     前記特徴ノード算出部は、特定の説明変数の特定の値又は範囲、及び前記目的変数の要素の特定の値又は範囲の少なくとも一つを含む分割条件、又はこれらの組み合わせによって表現される分割条件によって前記入力データセットを分割することを特徴とするデータ分析システム。
    The data analysis system according to claim 1,
    The feature node calculation unit is based on a division condition expressed by a division condition including at least one of a specific value or range of a specific explanatory variable and a specific value or range of an element of the objective variable, or a combination thereof. A data analysis system that divides the input data set.
  5.  請求項1に記載のデータ分析システムであって、
     前記スコア算出部は、前記説明変数のデータと前記目的変数のデータとに基づいて線形モデル推定を適用することによって、前記説明変数毎に前記目的変数の形式に対応したスコアを算出することを特徴とするデータ分析システム。
    The data analysis system according to claim 1,
    The score calculation unit calculates a score corresponding to the format of the objective variable for each explanatory variable by applying linear model estimation based on the data of the explanatory variable and the data of the objective variable. Data analysis system.
  6.  請求項1に記載のデータ分析システムであって、
     前記スコア算出部は、前記目的変数中の要素の一部のうち、前記分割条件ごとに指定された部分を集計して目的スコアを算出することを特徴とするデータ分析システム。
    The data analysis system according to claim 1,
    The said score calculation part calculates the objective score by totaling the part designated for every said division conditions among some elements in the said objective variable, The data analysis system characterized by the above-mentioned.
  7.  請求項6に記載のデータ分析システムであって、
     前記スコア算出部は、前記算出したスコア及び前記算出した目的スコアについて、前記各分割条件における特徴ノードごとの周辺データの数に基づいて、説明変数ごとに加重平均スコアを算出することを特徴とするデータ分析システム。
    The data analysis system according to claim 6, wherein
    The score calculation unit calculates a weighted average score for each explanatory variable based on the number of peripheral data for each feature node in each division condition for the calculated score and the calculated target score. Data analysis system.
  8.  請求項1に記載のデータ分析システムであって、
     前記演算装置が、前記各分割条件において、前記特徴ノード算出部によって算出された特徴ノードを二次元空間にマッピングするノードマッピング部を備えることを特徴とするデータ分析システム。
    The data analysis system according to claim 1,
    The data analysis system, wherein the arithmetic device includes a node mapping unit that maps a feature node calculated by the feature node calculation unit to a two-dimensional space under each of the division conditions.
  9.  請求項7に記載のデータ分析システムであって、
     前記演算装置が、前記各分割条件において、前記特徴ノード算出部によって算出された特徴ノードを二次元空間にマッピングするノードマッピング部を備え、
     前記ノードマッピング部は、前記説明変数ごとの特徴ノードの値と、前記算出されたスコアと、前記スコア及び目的スコアについて算出された加重平均スコアとを、ノードの幾何学的構造に基づいて画像化して表示するためのデータを生成することを特徴とするデータ分析システム。
    The data analysis system according to claim 7,
    The arithmetic device includes a node mapping unit that maps the feature node calculated by the feature node calculation unit to a two-dimensional space in each of the division conditions,
    The node mapping unit images the value of the feature node for each explanatory variable, the calculated score, and the weighted average score calculated for the score and the target score based on a geometric structure of the node. A data analysis system characterized by generating data for display.
  10.  請求項8に記載のデータ分析システムであって、
     前記ノードマッピング部は、前記特徴ノードのベクトル又は目的変数成分を含む特徴ノードのベクトルを、前記分割条件の特徴ノードの幾何学的構造に基づいて初期化した後、多次元尺度構成法を適用してマッピングを行うことを特徴とするデータ分析システム。
    The data analysis system according to claim 8,
    The node mapping unit initializes a feature node vector or a feature node vector including an objective variable component based on a geometric structure of the feature node of the division condition, and then applies a multidimensional scaling method. A data analysis system characterized by mapping.
  11.  請求項1に記載のデータ分析システムであって、
     前記入力データセットが、所定時間ごとの説明変数を含む時系列データである場合、当該説明変数を過去のある時点から現時点までの独立した変数として展開したデータを入力データとし、
     前記演算装置が、当該展開に用いた規則を格納することを特徴とするデータ分析システム。
    The data analysis system according to claim 1,
    When the input data set is time-series data including explanatory variables for each predetermined time, the input data is data that has been expanded as independent variables from a certain point in the past to the present time,
    A data analysis system, wherein the arithmetic device stores a rule used for the expansion.
  12.  計算機が実行するデータ分析方法であって、
     前記計算機は、プログラムを実行する演算装置と、前記演算装置と接続された記憶装置とを有し、
     前記方法は、
     前記演算装置が、機械学習モデルが学習時に用いた複数の説明変数からなる入力データセット又は前記説明変数が加工されたデータセットからなる入力データセットを、指定された分割条件で分割し、
     前記演算装置が、前記分割された各データセットの分布構造の特徴を表す特徴ノードを算出し、
     前記演算装置が、前記特徴ノードを含む入力データの近傍データを生成し、
     前記生成された近傍データの説明変数と、前記近傍データを前記機械学習モデルに入力して得られた目的変数のデータとに基づいて、当該説明変数と当該目的変数との関係性を表すスコアを算出し、
     前記演算装置が、前記スコアを含む出力結果を出力することを特徴とするデータ分析方法。
    A data analysis method executed by a computer,
    The computer includes an arithmetic device that executes a program, and a storage device connected to the arithmetic device,
    The method
    The arithmetic device divides an input data set consisting of a plurality of explanatory variables used by the machine learning model at the time of learning or an input data set consisting of a data set obtained by processing the explanatory variables according to a specified division condition,
    The arithmetic unit calculates a feature node representing the characteristics of the distribution structure of each divided data set;
    The arithmetic device generates neighborhood data of input data including the feature node,
    Based on the generated explanatory variable of the neighborhood data and the data of the objective variable obtained by inputting the neighborhood data to the machine learning model, a score representing the relationship between the explanatory variable and the objective variable is obtained. Calculate
    The data analysis method, wherein the arithmetic device outputs an output result including the score.
PCT/JP2019/005167 2018-04-24 2019-02-13 Data analysis system and data analysis mehtod WO2019207910A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018083408A JP6863926B2 (en) 2018-04-24 2018-04-24 Data analysis system and data analysis method
JP2018-083408 2018-04-24

Publications (1)

Publication Number Publication Date
WO2019207910A1 true WO2019207910A1 (en) 2019-10-31

Family

ID=68295150

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/005167 WO2019207910A1 (en) 2018-04-24 2019-02-13 Data analysis system and data analysis mehtod

Country Status (2)

Country Link
JP (1) JP6863926B2 (en)
WO (1) WO2019207910A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7263567B1 (en) 2022-01-11 2023-04-24 みずほリサーチ&テクノロジーズ株式会社 Information selection system, information selection method and information selection program
JP7472999B2 (en) 2020-10-08 2024-04-23 富士通株式会社 OUTPUT CONTROL PROGRAM, OUTPUT CONTROL METHOD, AND INFORMATION PROCESSING APPARATUS

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7353952B2 (en) 2019-12-09 2023-10-02 株式会社日立製作所 Analysis system and method
CN113344214B (en) * 2021-05-31 2022-06-14 北京百度网讯科技有限公司 Training method and device of data processing model, electronic equipment and storage medium
JP7314328B1 (en) 2022-01-11 2023-07-25 みずほリサーチ&テクノロジーズ株式会社 LEARNING SYSTEMS, LEARNING METHODS AND LEARNING PROGRAMS

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016009569A1 (en) * 2014-07-17 2016-01-21 Necソリューションイノベータ株式会社 Attribute factor analysis method, device, and program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016009569A1 (en) * 2014-07-17 2016-01-21 Necソリューションイノベータ株式会社 Attribute factor analysis method, device, and program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TANAKA, MASAHIRO ET AL.: "Clustering by Using Self Organizing Map", IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, vol. 2, 25 February 1996 (1996-02-25), pages 301 - 304 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7472999B2 (en) 2020-10-08 2024-04-23 富士通株式会社 OUTPUT CONTROL PROGRAM, OUTPUT CONTROL METHOD, AND INFORMATION PROCESSING APPARATUS
JP7263567B1 (en) 2022-01-11 2023-04-24 みずほリサーチ&テクノロジーズ株式会社 Information selection system, information selection method and information selection program
WO2023136118A1 (en) * 2022-01-11 2023-07-20 みずほリサーチ&テクノロジーズ株式会社 Information selection system, information selection method, and information selection program
JP2023102156A (en) * 2022-01-11 2023-07-24 みずほリサーチ&テクノロジーズ株式会社 Information selection system, information selection method, and information selection program

Also Published As

Publication number Publication date
JP6863926B2 (en) 2021-04-21
JP2019191895A (en) 2019-10-31

Similar Documents

Publication Publication Date Title
WO2019207910A1 (en) Data analysis system and data analysis mehtod
CN108009643B (en) A kind of machine learning algorithm automatic selecting method and system
Zhang et al. Local density adaptive similarity measurement for spectral clustering
Liu et al. Robust graph mode seeking by graph shift
US20140149412A1 (en) Information processing apparatus, clustering method, and recording medium storing clustering program
US9208278B2 (en) Clustering using N-dimensional placement
Van Leuken et al. Selecting vantage objects for similarity indexing
JP2000311246A (en) Similar image display method and recording medium storing similar image display processing program
Richards et al. Clustering and unsupervised classification
Hamel Visualization of support vector machines with unsupervised learning
Rahman et al. A flexible multi-layer self-organizing map for generic processing of tree-structured data
Chepushtanova et al. Persistence images: An alternative persistent homology representation
KR100895261B1 (en) Inductive and Hierarchical clustering method using Equilibrium-based support vector
Zhang et al. Bow pooling: a plug-and-play unit for feature aggregation of point clouds
KR101577249B1 (en) Device and method for voronoi cell-based support clustering
CN102855624B (en) A kind of image partition method based on broad sense data fields and Ncut algorithm
JP5370267B2 (en) Image processing system
JP6663323B2 (en) Data processing method, data processing device, and program
Karpagam et al. Improved content-based classification and retrieval of images using support vector machine
Li et al. A fast color image segmentation approach using GDF with improved region-level Ncut
Stanescu et al. A comparative study of some methods for color medical images segmentation
CN117609412B (en) Spatial object association method and device based on network structure information
Meng et al. Determining the number of clusters in co-authorship networks using social network theory
Chen et al. Dynamic image segmentation algorithm in 3D descriptions of remote sensing images
CN110750661B (en) Method, device, computer equipment and storage medium for searching image

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19792692

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19792692

Country of ref document: EP

Kind code of ref document: A1