WO2019207910A1 - Système d'analyse de données et procédé d'analyse de données - Google Patents

Système d'analyse de données et procédé d'analyse de données Download PDF

Info

Publication number
WO2019207910A1
WO2019207910A1 PCT/JP2019/005167 JP2019005167W WO2019207910A1 WO 2019207910 A1 WO2019207910 A1 WO 2019207910A1 JP 2019005167 W JP2019005167 W JP 2019005167W WO 2019207910 A1 WO2019207910 A1 WO 2019207910A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
node
score
analysis system
data analysis
Prior art date
Application number
PCT/JP2019/005167
Other languages
English (en)
Japanese (ja)
Inventor
前川 拓也
Original Assignee
株式会社日立ソリューションズ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立ソリューションズ filed Critical 株式会社日立ソリューションズ
Publication of WO2019207910A1 publication Critical patent/WO2019207910A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • This disclosure relates to a data analysis system.
  • Machine learning technologies such as neural networks are attracting attention. Attempts have been made to solve various problems using a machine learning model obtained by machine learning.
  • a similar case extracting unit that extracts a similar case set that is a set of cases similar to the predicted case from the known case set 1
  • a certainty factor calculation unit 2 that calculates a certainty factor of a prediction attribute value from a similar case set
  • a reliability measure calculation unit 3 that calculates a reliability measure of the certainty factor from the similar case set and the certainty factor
  • a prediction device configured to output a certainty factor of a certain predictive attribute value and a reliability measure of the certainty factor.
  • the method described in Japanese Patent Application Laid-Open No. 2003-323601 adds a reliability measure indicating the reliability of the certainty to the certainty of the prediction based on the similar case, and thereby the user's subsequent prediction result.
  • the user cannot know the degree of contribution of each explanatory variable to the prediction result. In other words, the user cannot know what factor led to the prediction result from the input data. In other words, the user uses the machine learning model while the relevance between the explanatory variable and the objective variable that is the prediction result is unknown in the neural network. For this reason, it is difficult for the user to know what kind of judgment should be made based on the prediction result.
  • the present invention has been made in view of such a situation, and provides a technique for visualizing the degree of influence of an explanatory variable on an objective variable and making it possible to grasp what judgment should be made based on the prediction result. provide.
  • the data analysis system includes an arithmetic device that executes a program and a storage device that is connected to the arithmetic device, and the arithmetic device includes an input that includes a plurality of explanatory variables used by the machine learning model during learning.
  • a feature node calculation that divides an input data set consisting of a data set or a data set obtained by processing the explanatory variable according to a specified division condition and calculates a feature node that represents a feature of a distribution structure of each divided data set.
  • a score calculation unit that calculates a score representing the relationship between the explanatory variable and the objective variable, and the arithmetic device includes: And an output processing unit that outputs the output results including A.
  • the machine learning model has been learned in advance, refers to the learning data used in the learning, and performs a process of obtaining an output result using the learned machine learning model. is there. Further, the machine learning model returns an output signal of a k-dimensional vector in response to an input signal of a d-dimensional vector, and the output signal of the machine learning model in this embodiment belongs to k classification classes. The description will be made assuming that it corresponds to the classification probability.
  • FIG. 1 is a diagram showing the data analysis system configuration of this embodiment.
  • the data analysis system of this embodiment is a computer that analyzes the relationship between input data and output data in machine learning, and includes an input device 101, an output device 102, a display device 103, a processing device 104, and a storage device 111.
  • the input device 101 is a keyboard, a mouse, or the like, and is an interface that receives input from the user.
  • the output device 102 is a printer or the like, and is an interface that outputs the execution result of the program in a format that can be visually recognized by the user.
  • the display device 103 is a display device such as a liquid crystal display device, and is an interface that outputs the execution result of the program in a format that can be visually recognized by the user.
  • a terminal connected to the data analysis system via a network may provide the input device 101, the output device 102, and the display device 103.
  • the processing device 104 includes a processor (arithmetic device) that executes a program and a memory that stores the program and data. Specifically, the input processing unit 106, the feature node calculation unit 107, the score calculation unit 108, the node mapping unit 109, and the output processing unit 110 are realized by the processor executing the program. Note that a part of processing performed by the processor executing the program may be executed by another arithmetic device (for example, FPGA).
  • FPGA arithmetic device
  • the memory includes a ROM that is a nonvolatile storage element and a RAM that is a volatile storage element.
  • the ROM stores an immutable program (for example, BIOS).
  • BIOS an immutable program
  • the RAM is a high-speed and volatile storage element such as a DRAM (Dynamic Random Access Memory), and temporarily stores a program executed by the processor 11 and data used when the program is executed.
  • the storage device 111 is a large-capacity non-volatile storage device such as a magnetic storage device (HDD) or a flash memory (SSD).
  • the storage device 111 stores data used by the processing device 104 when executing the program and a program executed by the processing device 104.
  • the storage device 111 includes a series of processes such as an input data table 112, a normalization information table 113, a division condition table 114, a node information table 115, a node distance table 116, a score table 117, and a weighted average score table 118.
  • the necessary data and output results are stored.
  • the program is read from the storage device 111, loaded into the memory, and executed by the processor.
  • the data analysis system may have a communication interface that controls communication with other devices according to a predetermined protocol.
  • the program executed by the processing device 104 is provided to the data analysis system via a removable medium (CD-ROM, flash memory, etc.) or a network, and is stored in the nonvolatile storage device 111 that is a non-temporary storage medium. For this reason, the data analysis system may have an interface for reading data from a removable medium.
  • a data analysis system is a computer system that is configured on a single physical computer or a plurality of logically or physically configured computers, and is a virtual system constructed on a plurality of physical computer resources. It may operate on a computer.
  • FIG. 2 is a diagram showing the data structure of the data analysis system of this embodiment.
  • the input data table 112 stores data obtained by processing the learning data of the machine learning model into a format used in a series of processes by the data analysis system of the present embodiment, and includes the explanatory variables 1 to d (201) and the objective variables 1 to k (202) is included.
  • the explanatory variables 1 to d (201) represent d-dimensional vectors that are input data of the machine learning model.
  • data is often normalized for each variable.
  • the normalized data is stored in the original numerical data using the normalization information table 113.
  • x_t0, x_t1, x_t2,. . . It can be flattened as a variable name at the value at each time point.
  • the number of dimensions of the explanatory variable 201 and the number of input dimensions of the machine learning model do not match.
  • the data format is converted each time.
  • the objective variables 1 to k (202) are k-dimensional vectors that are output results of the machine learning model.
  • the normalization information table 113 stores information related to normalization processing performed during learning of the machine learning model.
  • the variable ID 203, the variable name 204, the data type 205, the average 206, the standard deviation 207, and the model data format correspondence information 208 are stored. Contains data.
  • the variable ID 203 is an index that identifies the element of the explanatory variable 201.
  • the variable name 204 is the name of the explanatory variable.
  • the data type 205 is the data type of the explanatory variable (for example, logical type, integer type, floating point type, etc.).
  • the average 206 and standard deviation 207 store the average and standard deviation used in the normalization process when learning the machine learning model. However, for variables that are not subjected to normalization processing, such as when the variables are logical, it is preferable to set the average to 0, the standard deviation to 1, and the like.
  • the model data format correspondence information 209 stores information for mutual conversion when the input format of the machine learning model is different from the input format handled by the data analysis system of this embodiment. For example, in the case of data including a time series, the variable x is set to x_t0, x_t1,. . . Therefore, mutual conversion is possible by describing the correspondence between the index before expansion and the index after expansion.
  • the division condition table 114 stores conditions for the feature node calculation unit 107 to divide the input data table 112, and the condition ID 209, the division condition 210, the number of data 211, the map size 212, and the data of the aggregation flags 1 to k (213) are stored. Including.
  • the condition ID 209 is identification information for identifying a condition recorded in the division condition table 114.
  • the division condition 210 is a condition for dividing the input data to obtain one data set. For example, a character string such as an SQL select statement may be used.
  • the division condition 210 may describe a specific value or range for the explanatory variable, or a combination of values for the objective variable.
  • the number of data 211 is the number of data in the input data selected according to the division condition.
  • the map size 212 stores the map size used by the feature node calculation unit 107 in the node vector calculation step 403 of FIG.
  • the value of the map size 212 may be set to NULL or the like, and the result of the automatic setting may be stored.
  • Aggregation flags 1 to k (213) are k flag arrays for objective variables used in the objective score aggregation process 504 in FIG. Only the scores for the objective variables for which 1 is set in this array are aggregated to obtain the objective score. For example, in the analysis aiming at rank-up from the current rank in the member management system, each current rank is set as a division condition, and the flag of the objective variable corresponding to the predicted rank higher than the current rank is set to 1.
  • the node information table 115 stores the result of the feature node calculation by the feature node calculation unit 107, and includes a condition ID 214, a node ID 215, a hit number 216, a hit rate 217, coordinates 218, explanatory variables 1 to d (219), and objective variables 1 to k (220) data are included.
  • the condition ID 214 is identification information (condition ID 209) for identifying a condition recorded in the division condition table 114.
  • the node ID 215 is node identification information that satisfies the condition specified by the condition ID 214.
  • the number of hits 216 is the number of data of the node identified by the node ID 215 in the data set divided by the division condition at a distance closer to the node than the other nodes.
  • the hit rate 217 is a value obtained by dividing the hit number 216 by the data number 211.
  • the coordinates 218 are the processing result of the node mapping process 304 shown in FIG.
  • the explanatory variables 1 to d (219) and the objective variables 1 to k (220) are data in the same format as the input data table 112, and are node vectors representing the characteristics of the distribution structure for the input data set. This vector does not have to match the data included in the input data table, and does not have to follow the type specified in the data type 205. For example, even if a logical value or an integer value is specified, it can be stored as floating point type data.
  • the node distance table 116 stores the distance between nodes for the explanatory variable 219 of the node information table 115 or the node vector obtained by adding the objective variable 220 to the explanatory variable 219, and stores the data of the node from 221, the node to 222, and the distance 223. Including.
  • the node from 221 and the node to 222 are identification information for specifying the nodes recorded in the node information table 115, respectively.
  • the values of the node from 221 and the node to 222 may be a set of the condition ID 214 and the node ID 215, or may be an index on the node information table 115.
  • a distance 223 is a distance of a node vector between the node from 221 and the node to 222.
  • the node distance table 116 may be expressed as a two-dimensional array.
  • the index of the node information table 115 is used for the row and the column.
  • the score table 117 stores the calculation result of the score calculation unit 108, and includes data of the objective variable ID 224, the condition ID 225, the node ID 226, and the score 227 of the explanatory variables 1 to d.
  • the objective variable ID 224 stores the element number of the k-dimensional vector in the output result of the machine learning model.
  • the condition ID 225 and the node ID 226 are identification information for specifying the node recorded in the node information table 115, and use values common to the condition ID 214 and the node ID 215 of the node information table 115.
  • the score 227 of the explanatory variables 1 to d is a calculation result of the score calculation unit 108, and is stored for each objective variable ID 224, condition ID 225, node ID 226, and explanatory variable.
  • the score table 117 also stores a target score and a weighted average score described in FIG.
  • the objective score is set to ⁇ 1 or the like in the objective variable ID 224
  • the weighted average score is set to ⁇ 1 or the like in the objective variable ID 224 and the node ID 226, indicating that the objective variable and the node do not identify a specific thing. Is.
  • the weighted average score table 118 stores the division condition and the score for each explanatory variable in such a manner that the objective variable 224 and the node ID 226 are not specified as -1. Specifically, the weighted average score table 118 divides the weighted average score calculated in step 505 of the score calculation process 303 (FIG. 5) described later for each division condition, and sets the scores of each explanatory variable in descending order of absolute values. A list that is sorted and listed with variable names. The weighted average score table 118 is generated in step 701 of the output process 305 (FIG. 7).
  • the weighted average score table 118 allows the user to easily grasp the explanatory variable having a high influence degree for each target layer represented by each division condition, and compare the rank and score of the explanatory variable under the division condition. For example, in the conditions 1 and 2, the influence degree of the attribute A is large, and in the conditions 3 and 4, the influence degree of the attribute J is large. In addition, the score of the attribute I has the opposite sign, and if the same measure is applied, the effect may appear in reverse. In this way, it can be used for planning measures for the target layer indicated by each condition.
  • FIG. 3 is a flowchart of the overall processing of this embodiment.
  • the input processing unit 106 executes input processing (301).
  • the input processing unit 106 refers to the normalization information table 113, converts the learning data of the machine learning model from the input format to the input format of the present embodiment, and returns the normalized numerical value to the original value. And the result is stored in the input data table 112.
  • the feature node calculation unit 107 executes feature node calculation processing (302). For example, the feature node calculation unit 107 divides the input data table 112 according to the division condition table 114, calculates a feature node from each divided data set, and stores the result in the node information table. Details of the feature node calculation process will be described with reference to FIG.
  • the score calculation unit 108 executes a score calculation process (303). For example, the score calculation unit 108 calculates a score representing the degree of influence of the explanatory variable, and stores the result in the score table. Details of the score calculation process will be described with reference to FIG.
  • the node mapping unit 109 executes a node mapping process (304). For example, the node mapping unit 109 maps the feature node obtained in step 302 to the low-dimensional space. Details of the node mapping process will be described with reference to FIG.
  • the output processing unit 110 executes the output process (305) and ends the process. Details of the output process will be described with reference to FIG.
  • FIG. 4 is a flowchart of the feature node calculation process 302 of this embodiment.
  • the feature node calculation unit 107 loops the variable p from 1 to the number of data in the division condition table 114 (401). Thereafter, the processing from step 402 to step 405 is executed for the p-th division condition.
  • the feature node calculation unit 107 performs data division processing (402). For example, data that satisfies the division condition 210 of the p-th division condition is selected from the input data table 112. The selected data set is subjected to normalization processing using a normalization information table.
  • the feature node calculation unit 107 calculates a node vector (403). For example, a node vector representing the feature is calculated with a smaller number of nodes in consideration of the distribution structure of the selected data set by a clustering method typified by the k-average method.
  • a self-organizing map (hereinafter abbreviated as SOM) is applied.
  • SOM is a kind of neural network expressed by edges connecting nodes arranged in a grid and adjacent nodes. Each node is assigned a reference vector in the same format as the input data.
  • the reference vector is updated so that the reference vector of the node connected to the BMU is approximated to the learning data together with the reference vector of the node having the closest distance to the SOM learning data (hereinafter referred to as BMU (Best Matching Unit)). Since SOM is a known technique, a detailed description of the technique is omitted. By repeating this process, it is possible to map the complicated distribution structure of the learning data to the geometric structure of the node.
  • the reference vector of each node calculated as a result of the SOM is stored in the node information table 115 in the form of an explanatory variable 219 and a target variable 220.
  • the format of the learning data when executing the SOM can be set only by the explanatory variable or by the combination of the explanatory variable and the objective variable. Which format to use is preferably set in advance. And the objective variable 220 as an output result follows the input format of these learning data.
  • the feature node calculation unit 107 counts the number of hits for each node (404).
  • the number of data in the selected data set with BMU as the BMU is calculated as the value of the hit number 216.
  • the hit rate 217 is calculated by dividing this by the number of selected data items.
  • the feature node calculation unit 107 stores the calculated result in the data storage area (405). At this time, since the node vector calculated in step 403 has been normalized, the normalization information table 113 is used to restore the original, and the result is stored.
  • step 401 to step 405 ends, the feature node calculation process 302 ends.
  • FIG. 5 is a flowchart of the score calculation process 303 of this embodiment.
  • the score calculation unit 108 loops the variable i from 1 to the number of data items in the node information table 115 (501). Thereafter, the processing from step 502 to step 504 is executed for the i-th node.
  • the score calculation unit 108 generates a neighborhood data set of the node i and a prediction result of the machine learning model corresponding thereto (502).
  • the neighborhood data is vector data located around the d-dimensional vector represented by the explanatory variable of the node specified by the variable i.
  • a method for generating the neighborhood data a method is used in which the value of the explanatory variable of the node i is averaged, and is generated by a random number according to a normal distribution in which half the standard deviation of the normalized information is a standard deviation.
  • the number of data items in the neighborhood data set may be designated in advance. Prediction using a machine learning model can be executed by performing normalization using a normalization information table and conversion using model data format correspondence information 208.
  • the score calculation unit 108 performs local model estimation processing on the generated neighborhood data set and the prediction result of the machine learning model (503).
  • step 503 a score representing the relationship between the explanatory variable and the objective variable is obtained for the neighborhood data.
  • Y, Y, Si, and C are k-dimensional vectors. Since the linear model estimation technique is a known technique, a detailed description of the technique is omitted.
  • the score calculation unit 108 calculates the target score by counting the scores obtained in step 503 according to the counting flag 213 (504). Specifically, the scores of the elements whose flag is 1 are totaled for each explanatory variable.
  • the score calculation unit 108 calculates the weighted average score by applying the hit rate 217 to the target score (505).
  • the weighted average score is calculated for each explanatory variable for all nodes having the same condition ID.
  • the score calculation unit 108 stores the calculated result in the data storage area (506), and ends the score calculation process.
  • FIG. 6 is a flowchart of the node mapping process 304 of this embodiment.
  • a multidimensional scaling method hereinafter abbreviated as MDS
  • MDS multidimensional scaling method
  • MDS is a technique for mapping nodes in a multidimensional vector space to a low-dimensional space such as 2D or 3D, and performs mapping so that the distance between nodes is reproduced as much as possible. Since MDS is a known technique, a detailed description of the technique is omitted. In this embodiment, when MDS is applied, initialization is performed in consideration of the geometric structure of the SOM node.
  • the node mapping unit 109 generates a node distance table 116 (601).
  • each feature node vector is set as a normalized explanatory variable 219, and a distance table is generated based on the Euclidean distance.
  • the node mapping unit 109 initializes each variable (602). Specifically, first, lt, lb, rt, and rb are respectively defined as upper left, lower left, upper right, and lower right node indexes in the lattice-like SOM node structure, and all are set to -1. Next, y is set to 0. Next, the array Pos is defined as an array for storing the coordinates of each node. Then, Sw and Sh are defined as node coordinate arrays in the x direction and the y direction, respectively. This array size is determined by the map size 212. All elements of Pos, Sw, and Sh are initialized with 0.
  • the node mapping unit 109 loops the variable p from 1 to the number of data items in the division condition table 114 (603). Thereafter, the processing from step 604 to step 609 is executed for the p-th division condition.
  • the node mapping unit 109 inputs a number obtained by adding a predetermined number (eg, 2) to the maximum value in the array Sh to y (605).
  • the predetermined number may be changed to an appropriate value.
  • the node mapping unit 109 proceeds to step 606 without doing anything.
  • the four corner nodes index for the node of the division condition p are set to lt, lb, rt, and rb, respectively (606). At this time, it is possible to set lt to rb + 1 and set the remaining variables according to the map size 212.
  • the node mapping unit 109 sets values for Sw and Sh (607). In this embodiment, a value obtained by equally dividing the distance between the nodes lt and rt and the distance between lt and lb according to the map size is set.
  • the node mapping unit 109 adds y to each element of Sh (608). If it is desired to move in the x-axis direction, a variable x may be defined and the same processing as y may be applied to Sw.
  • the node mapping unit 109 sets the coordinates of the nodes lt to rb to Pos (609).
  • the coordinates of the node at the position of i row and j column in the SOM node structure may be set by (Sw [i], Sh [j]).
  • the node mapping unit 109 stores the result in the storage area (611), and ends the node mapping process.
  • FIG. 7 is a flowchart of the output process 305 of this embodiment.
  • the output processing unit 110 lists the weighted average scores and generates the weighted average score table 118 (701).
  • the weighted average score table 118 divides the weighted average score for each division condition, sorts the scores of each explanatory variable in descending order of absolute values, and lists the variable names.
  • the output processing unit 110 displays a component map of the node vector (702).
  • the component map is obtained by visualizing the value of the specific explanatory variable 319 or the objective variable 220 of each node under the same conditions by the geometric structure of the SOM node and the map size. For example, in the component map of the explanatory variable i when the map size is m ⁇ n, the value of the explanatory variable i in all the nodes having the same condition ID in the node information table is converted into an m ⁇ n image in a color corresponding to the value. indicate.
  • the component map of the present embodiment images the value of the explanatory variable 319 based on the geometric structure of the node for each explanatory variable 219 with respect to a specific division condition.
  • a component map using the objective variable 220 can also be displayed. With the component map, the correlation between each explanatory variable and the correlation between the explanatory variable and the objective variable can be visually grasped.
  • the output processing unit 110 displays a hit map (703).
  • the hit map is obtained by visualizing the hit number 216 (or its logarithm) or the hit rate 217 using the visualization method of step 702.
  • the number of hits is imaged by color coding based on the logarithm of the hit rate 217 as illustrated in FIG. Further, as shown in the figure, the numerical value of the hit number may be displayed. With the hit map, it is possible to grasp the dense nodes in the learning data distribution.
  • the output processing unit 110 displays a score map (704).
  • the score map is obtained by visualizing the score 227 or the objective score for a specific explanatory variable using the visualization method in step 702.
  • the score 227 is imaged by color coding based on the score 227 for each explanatory variable, as illustrated in FIG. For example, if the score is 0, the color is set to green, and the color change gradually changes to red in the positive direction and blue in the negative direction. Easy to grasp. Further, as shown in the figure, by comparing the pattern with the component map of the corresponding explanatory variable, it is possible to grasp the state of the value of the explanatory variable at the node having a high influence level.
  • the output processing unit 110 displays a node map (705).
  • the node map is obtained by visualizing each node as a point in a low-dimensional space based on the coordinates 218 for each node calculated in step 304.
  • the shape and color of a point representing each node may be set according to the value of the explanatory variable in the node information table, the value of the objective variable, the score for each explanatory variable in the score table, the objective score, the division condition, and the like.
  • the node map of this embodiment is obtained by plotting nodes in a two-dimensional space based on the node coordinates 218 under each division condition, as illustrated in FIG. Further, the geometric structure of the node under a specific division condition may be displayed by a grid line.
  • the node map can grasp the positional relationship of each node under a plurality of division conditions. For example, when the current rank is used as a division condition, by looking at nodes that are close to each other, it is possible to easily find a node that is likely to increase in rank or have a high risk of decreasing. Differences in features from those neighboring nodes can be directly compared with values in the node information table or compared using a component map.
  • the data analysis system is configured such that an input data set including a plurality of explanatory variables used by the machine learning model at the time of learning or an input data set including a data set obtained by processing the explanatory variables.
  • a feature node calculation unit 107 that calculates a feature node that represents a feature of a distribution structure of each of the divided data sets, generates near data of input data including the feature node, and is divided according to a specified division condition, Based on the generated explanatory variable of the neighborhood data and the objective variable data obtained by inputting the neighborhood data to the machine learning model, a score representing the relationship between the explanatory variable and the objective variable is calculated.
  • a score calculation unit 108 that outputs the output result including the score.
  • the feature node representing the feature of the distribution structure can represent the feature of the data set with less data than the learning data.
  • the feature node can be complemented by representing the feature of the data set by the neighborhood data. That is, it is possible to represent the characteristics of the data set with a small amount of data and reduce the amount of calculation.
  • the feature node calculation unit 107 calculates the feature node based on the input data set to which the self-organizing map is applied, so that the feature node can be accurately calculated.
  • the feature node calculation unit 107 is an input data set including a plurality of explanatory variables used by the machine learning model during learning and an objective variable calculated by the machine learning model, or the explanatory variable and the objective variable are processed. Since the feature node is calculated using an input data set consisting of data sets, the objective variables can be compared on a map.
  • the feature node calculation unit 107 includes a division condition including at least one of a specific value or range of a specific explanatory variable and a specific value (for example, maximum value) or range of an element of the objective variable, or these Since the input data set is divided according to the dividing condition expressed by the combination, analysis with the target layer narrowed down can be performed. That is, data of a group having a specific attribute can be analyzed by changing the attribute according to the purpose, not the entire group.
  • the score calculation unit 108 calculates a score corresponding to the format of the objective variable for each explanatory variable by applying linear model estimation based on the data of the explanatory variable and the data of the objective variable. Because the linear model is simple and easy to handle, it is easy for the user to understand and can improve the reliability of the results. In particular, in a linear model, when a plurality of attributes are integrated, calculation can be performed with the sum of probabilities, so that the user can easily understand intuitively.
  • the score calculation unit 108 calculates an objective score by summing up the parts specified for each of the division conditions among some of the elements in the objective variable, so that analysis with a narrowed down target layer can be performed. That is, data of a group having a specific attribute can be analyzed by changing the attribute according to the purpose, not the entire group.
  • the score calculation unit 108 calculates a weighted average score for each explanatory variable based on the number of peripheral data for each feature node in each division condition for the calculated score and the calculated target score. Considering the distribution, data characteristics can be expressed correctly.
  • each division condition includes a node mapping unit 109 that maps the feature node calculated by the feature node calculation unit 107 in the two-dimensional space in each division condition, so that the characteristics of the group can be easily expressed. .
  • the node mapping unit 109 calculates the feature node value for each explanatory variable, the calculated score, and the weighted average score calculated for the score and the target score based on the geometric structure of the node. Since data to be displayed as an image is generated, it is possible to compare data between groups of different attributes while maintaining the distance relationship between nodes.
  • the node mapping unit 109 initializes the feature node vector or the feature node vector including the objective variable component based on the geometric structure of the feature node of the division condition, and then performs a multidimensional scaling method. Since mapping is applied and applied, attributes with high influence and attributes with low influence can be expressed in an easy-to-understand manner using the score map.
  • the input data set is time-series data including explanatory variables for each predetermined time
  • the data obtained by expanding the explanatory variables as independent variables from a certain point in the past to the present time is used as input data and used for the expansion. Since the rules are stored, the input data set can be analyzed as time series data.
  • the present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the spirit of the appended claims.
  • the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described.
  • a part of the configuration of one embodiment may be replaced with the configuration of another embodiment.
  • another configuration may be added, deleted, or replaced.
  • each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.
  • Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
  • a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
  • control lines and information lines indicate what is considered necessary for explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un système d'analyse de données comprenant un dispositif de calcul permettant d'exécuter un programme et un dispositif de mémorisation connecté au dispositif de calcul. Le dispositif de calcul comprend : une unité de calcul de nœud de caractéristique permettant de diviser un ensemble de données d'entrée comprenant une pluralité de variables explicatives utilisées lors de l'apprentissage par un modèle d'apprentissage automatique, ou un ensemble de données d'entrée dans lequel les variables explicatives comprennent un ensemble de données traitées, selon une condition de division désignée, et de calculer un nœud de caractéristique qui représente la caractéristique de structure de distribution de chacun des ensembles de données divisés ; une unité de calcul de score permettant de générer des données de voisinage de données d'entrée qui comprennent le nœud de caractéristique et de calculer, sur la base d'une variable explicative des données de voisinage générées et des données d'une variable objective obtenues par l'introduction des données de voisinage dans le modèle d'apprentissage automatique, un score qui représente la relation entre la variable explicative et la variable objective ; et une unité de traitement de sortie permettant de délivrer en sortie un résultat de sortie qui comprend le score.
PCT/JP2019/005167 2018-04-24 2019-02-13 Système d'analyse de données et procédé d'analyse de données WO2019207910A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018083408A JP6863926B2 (ja) 2018-04-24 2018-04-24 データ分析システム及びデータ分析方法
JP2018-083408 2018-04-24

Publications (1)

Publication Number Publication Date
WO2019207910A1 true WO2019207910A1 (fr) 2019-10-31

Family

ID=68295150

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/005167 WO2019207910A1 (fr) 2018-04-24 2019-02-13 Système d'analyse de données et procédé d'analyse de données

Country Status (2)

Country Link
JP (1) JP6863926B2 (fr)
WO (1) WO2019207910A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7263567B1 (ja) 2022-01-11 2023-04-24 みずほリサーチ&テクノロジーズ株式会社 情報選択システム、情報選択方法及び情報選択プログラム
JP7472999B2 (ja) 2020-10-08 2024-04-23 富士通株式会社 出力制御プログラム、出力制御方法および情報処理装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7353952B2 (ja) * 2019-12-09 2023-10-02 株式会社日立製作所 分析システムおよび分析方法
JP7499597B2 (ja) 2020-04-16 2024-06-14 株式会社日立製作所 学習モデル構築システムおよびその方法
CN113344214B (zh) * 2021-05-31 2022-06-14 北京百度网讯科技有限公司 数据处理模型的训练方法、装置、电子设备及存储介质
JP7314328B1 (ja) 2022-01-11 2023-07-25 みずほリサーチ&テクノロジーズ株式会社 学習システム、学習方法及び学習プログラム

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016009569A1 (fr) * 2014-07-17 2016-01-21 Necソリューションイノベータ株式会社 Procédé, dispositif et programme d'analyse de facteur d'attribut

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016009569A1 (fr) * 2014-07-17 2016-01-21 Necソリューションイノベータ株式会社 Procédé, dispositif et programme d'analyse de facteur d'attribut

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TANAKA, MASAHIRO ET AL.: "Clustering by Using Self Organizing Map", IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, vol. 2, 25 February 1996 (1996-02-25), pages 301 - 304 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7472999B2 (ja) 2020-10-08 2024-04-23 富士通株式会社 出力制御プログラム、出力制御方法および情報処理装置
JP7263567B1 (ja) 2022-01-11 2023-04-24 みずほリサーチ&テクノロジーズ株式会社 情報選択システム、情報選択方法及び情報選択プログラム
WO2023136118A1 (fr) * 2022-01-11 2023-07-20 みずほリサーチ&テクノロジーズ株式会社 Système de sélection d'informations, procédé de sélection d'informations et programme de sélection d'informations
JP2023102156A (ja) * 2022-01-11 2023-07-24 みずほリサーチ&テクノロジーズ株式会社 情報選択システム、情報選択方法及び情報選択プログラム

Also Published As

Publication number Publication date
JP6863926B2 (ja) 2021-04-21
JP2019191895A (ja) 2019-10-31

Similar Documents

Publication Publication Date Title
WO2019207910A1 (fr) Système d'analyse de données et procédé d'analyse de données
Liu et al. Robust graph mode seeking by graph shift
US9519660B2 (en) Information processing apparatus, clustering method, and recording medium storing clustering program
CN108009643A (zh) 一种机器学习算法自动选择方法和系统
JP6888484B2 (ja) 検索プログラム、検索方法、及び、検索プログラムが動作する情報処理装置
US9208278B2 (en) Clustering using N-dimensional placement
Wang et al. Enhancing minimum spanning tree-based clustering by removing density-based outliers
JP2000311246A (ja) 類似画像表示方法及び類似画像表示処理プログラムを格納した記録媒体
Srivastava et al. Deeppoint3d: Learning discriminative local descriptors using deep metric learning on 3d point clouds
Hamel Visualization of support vector machines with unsupervised learning
Zhang et al. Bow pooling: a plug-and-play unit for feature aggregation of point clouds
Zhu et al. Multi‐image matching for object recognition
JP5370267B2 (ja) 画像処理システム
Lim et al. A fuzzy qualitative approach for scene classification
Taşkın et al. An adaptive affinity matrix optimization for locality preserving projection via heuristic methods for hyperspectral image analysis
KR100895261B1 (ko) 평형기반 서포트 벡터를 이용한 귀납적이고 계층적인군집화 방법
KR101577249B1 (ko) 보로노이 셀 기반의 서포트 클러스터링 장치 및 방법
Wang et al. An efficient k-medoids clustering algorithm for large scale data
CN102855624B (zh) 一种基于广义数据场和Ncut算法的图像分割方法
JP6663323B2 (ja) データ処理方法、データ処理装置、及びプログラム
Li et al. A Fast Color Image Segmentation Approach Using GDF with Improved Region‐Level Ncut
Stanescu et al. A comparative study of some methods for color medical images segmentation
Meng et al. Determining the number of clusters in co-authorship networks using social network theory
Recaido et al. Visual Explainable Machine Learning for High-Stakes Decision-Making with Worst Case Estimates
CN117609412B (zh) 一种基于网络结构信息的空间对象关联方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19792692

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19792692

Country of ref document: EP

Kind code of ref document: A1