WO2023063485A1

WO2023063485A1 - Data visualization method and device therefor

Info

Publication number: WO2023063485A1
Application number: PCT/KR2021/017808
Authority: WO
Inventors: 최유리; 피에 로말리자장
Original assignee: 주식회사 솔리드웨어
Priority date: 2021-10-14
Filing date: 2021-11-30
Publication date: 2023-04-20
Also published as: KR20230053384A

Abstract

A data visualization method and a device therefor are disclosed. The data visualization device groups, into a plurality of clusters, a plurality of data samples having variable values for a plurality of variables, identifies, among the plurality of variables, main variables indicating the difference between the plurality of clusters, extracts a certain number of data samples for each cluster, the extraction allowing data samples including minimum, maximum, or average variable values for the main variables to be included, and performs visualization on the basis of the main variables and the extracted data samples.

Description

Data visualization method and its device

Embodiments of the present invention relate to a method and apparatus for visualizing data, and more particularly, to a method and apparatus for visualizing a result of data clustering.

Unsupervised learning is often used to understand and model unlabeled data. For example, customer information in a marketing database, a large-scale public survey, and a large-scale chemical compound library test result may be classified into a plurality of clusters using an unsupervised learning model. However, most of the data has a limitation in that it does not contain information that can guide the training of unsupervised learning models.

Existing methods to directly evaluate the results of a trained model, such as the Calinski-Harabazs index or the silhouette index, are methodologies that rely on geometrical properties of the resulting clusters and underlying data. This evaluation method has a disadvantage that may not fit well with the analysis purpose of the model. Therefore, additional information may be needed to understand the clustering results.

A common way to provide additional information is through visualizations such as diagrams and pictures. Appropriate visualization can help (1) understand data structure, (2) interpret clustering results, (3) compare between clusters, and (4) detect outliers in data.

Examples of visualization methods for multidimensional data include line plots and scatter plots, projections, Chernoff faces, star and radar plots, correlation plots, There are matrix plots, parallel coordinates, heat maps, etc. However, this visualization method does not know the target when there is no target variable, so the uncertainty of not knowing what important information to visualize increases. Also, visualizing high-dimensional data, i.e. hundreds of variables, is often counterproductive and prevents a good understanding of the results. Also, many visualization methods have limitations on the number of variables that can be displayed simultaneously. In order to represent complex interactions between variables, it is necessary to represent the relationship between variables, but finding such interactions is computationally difficult because the number of interactions is significantly greater than the number of variables. In addition, when users manually select variables and visualization methods when visualizing data, it is difficult to make accurate decisions and errors are likely to occur in situations where the data is not well known.

A technical problem to be achieved by an embodiment of the present invention is to provide a method and apparatus for efficiently visualizing by extracting main variables and main samples so that users can better understand the results of data clustering.

An example of a data visualization method according to an embodiment of the present invention for achieving the above technical problem is a data visualization method performed by a data visualization device, which includes a plurality of data samples composed of variable values for a plurality of variables. clustering into a plurality of clusters; identifying a major variable representing a difference between the plurality of clusters among the plurality of variables; Extracting a certain number of data samples for each cluster so that data samples including minimum, maximum, or average variable values for the main variable are included; and visualizing based on the main variable and the extracted data sample.

An example of a data visualization stop according to an embodiment of the present invention for achieving the above technical problem is a clustering unit for clustering a plurality of data samples composed of variable values for a plurality of variables into a plurality of clusters; a variable selection unit to determine a main variable representing a difference between the plurality of clusters among the plurality of variables; For each cluster, a sample selection unit for extracting a certain number of data samples so that data samples including minimum, maximum, or average variable values for the main variable are included; and a visualization unit for visualizing based on the main variable and the extracted data sample.

According to an embodiment of the present invention, it is possible to provide efficient data visualization by selecting key variables and key data instead of all data.

1 to 3 are views showing an example of a plot used for data visualization according to an embodiment of the present invention;

4 is a flowchart illustrating an example of a data visualization method according to an embodiment of the present invention;

5 is a diagram showing an example of a data sample according to an embodiment of the present invention;

6 is a diagram showing an example of clustering according to an embodiment of the present invention;

7 and 8 are diagrams showing an example of a method for identifying key variables according to an embodiment of the present invention;

9 is a diagram showing an example of another method of extracting a main variable according to an embodiment of the present invention;

10 is a diagram showing an example of a method of extracting a main data sample according to an embodiment of the present invention, and

11 is a diagram showing the configuration of an example of a data visualization device according to an embodiment of the present invention.

Hereinafter, a data visualization method and apparatus according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

1 to 3 are diagrams illustrating an example of a plot used for data visualization according to an embodiment of the present invention.

Referring to FIG. 1 , an example of a heat map displaying 12 variables for 4 clusters is shown. The color reflects the normalized mean value of the variable within the cluster. The heatmap provides an overview of the variable values in each cluster. Heatmaps can be useful for interpreting clusters and understanding differences between clusters, providing an overview of the relationship between different variables at the possible line.

Referring to FIG. 2, an example of a parallel coordinate diagram is shown. Each line in the parallel coordinate plot represents a data sample. Variables are displayed on the vertical axis, except for the rightmost variable, and the color indicates clustering. A parallel coordinate plot provides a detailed picture of the data samples within a cluster. Parallel coordinate plots can help evaluate the distribution of variable values within each cluster, are useful for interpreting and comparing clusters, and provide an overview of the relationships between different variables.

Referring to Figure 3, a projection chart is shown. Each point in the figure is a multivariate data sample projected on a two-dimensional plane, and different colors represent different clusters. Projection charts can help visually assess the topological structure of data in terms of distances between data points, outliers, and density distributions.

This embodiment only shows an example of a visualization method for better understanding, but the present invention is not limited to the visualization methods of FIGS. 1 to 3 , and various conventional visualization methods may be applied to this embodiment.

4 is a flowchart illustrating an example of a data visualization method according to an embodiment of the present invention.

Referring to FIG. 4 , a data visualization device (hereinafter referred to as 'device') clusters a plurality of data samples into a plurality of clusters (S400). The data sample is data composed of variable values for a plurality of variables, and an example thereof is shown in FIG. 5 . The device may perform clustering using various conventional clustering algorithms (eg, k-mean, etc.) such as an unsupervised learning model, and an example thereof is shown in FIG. 6 .

The device determines a major variable representing a difference between a plurality of clusters among a plurality of variables constituting the data sample (S410). The number of main variables may be set in various ways according to embodiments. For example, if the number of variables in the data sample is 100, the device may set 10 as the number of main variables. An example of a specific method for determining a major variable will be reviewed again in FIGS. 7 to 9 .

The apparatus extracts a certain number of data samples for each cluster, but extracts data samples including at least one variable value among the minimum, maximum, and average values of the main variables (S420). The number of data samples to be extracted may be variously set according to the embodiment. A specific example of a data sample extraction method will be reviewed again in FIG. 10 .

When the main variables and main data samples are extracted, the device performs visualization based on them and displays them (S430). For example, the device may display main variables and main data samples using various plots shown in FIGS. 1 to 3 .

5 is a diagram showing an example of a data sample according to an embodiment of the present invention.

Referring to FIG. 5 , a data set 500 includes a plurality of data samples 520 . Each data sample 520 includes variable values for a plurality of variables 510 . For example, the data set 500 of this embodiment includes M data samples 520, and each data sample includes variable values for n variables 510. This embodiment is only an example to aid understanding, and the shape of the dataset may be variously modified according to the embodiment.

6 is a diagram illustrating an example of clustering according to an embodiment of the present invention.

Referring to FIG. 6 , the device classifies a plurality of data samples 600 into a plurality of

clusters

610 , 620 , and 630 . For example, the device may cluster data samples using various conventional clustering algorithms (eg, k-means, etc.). The number of

clusters

610, 620, and 630 may be set by a user or automatically.

7 and 8 are diagrams illustrating an example of a method for determining a major variable according to an embodiment of the present invention.

Referring to FIGS. 7 and 8 together, the device generates a plurality of cluster combinations including at least two of the plurality of clusters (S700). For example, if a plurality of data samples are clustered into three clusters as shown in FIG. 8, three different combinations of (C1, C2), (C2, C3), and (C1, C3) are generated. The number of cluster combinations depends on the number of clusters.

For each cluster combination, the device compares the distribution of variable values for each variable in the two clusters to determine the cluster difference for each variable (S710). For example, referring to FIG. 8 , the device determines the cluster difference of each variable (X1 to X5) 810 for the C1&C2 cluster combination 820. That is, the distribution of the X1 variable values of the data samples belonging to the C1 cluster is compared with the distribution of the X1 variable values of the data samples belonging to the C2 cluster, and the difference is identified through a statistical method. Various statistical methods for calculating the difference in the distribution of variable values for each variable in the two clusters may be applied to this embodiment. For example, the device may calculate a p-value of non-parametric statistical test comparing the distribution of variable values for each variable in two clusters of each cluster combination (820, 822, 824) and use it as a cluster difference.

The device selects a predefined number of variables as main variables based on the size of the cluster difference for each variable in a plurality of cluster combinations (S720). For example, the device may select as a main variable a variable showing a large difference in the distribution of values of each variable of the two clusters of each cluster combination, or select a variable with high importance as a main variable in a model approximating the clustering result.

For example, referring to FIG. 8 , the cluster difference of each variable 810 of two clusters of each cluster combination (820, 822, 824) is calculated and displayed as a p-value of non-parametric statistical test. Since the method for calculating the p-value of the non-parametric statistical test is already widely known, the description thereof will be omitted. In the C1&C2 cluster combination 820, the p-value of comparing the distribution of the variable value of the X1 variable in the data samples belonging to the C1 cluster and the distribution of the variable value of the X1 variable in the data samples belonging to the C2 cluster is 0.131, and the variables X2 and X3 The p-values of ,X4,X5 are 0.185, 0.021, 0.082, and 0.016, respectively.

The device identifies a predefined number of p-values in order of smaller order among p-values that do not exceed a preset threshold value (eg, 0.05). 8 shows a case in which five p-values 830 selected in descending order are selected. The number of selected p-values may be variously modified according to embodiments.

The device may select each variable corresponding to the selected p-value as the main variable. For example, variables X3 and X5 are selected in the clustering combination C1&C2 820, variable X3 is selected in the clustering combination C2&C3 822, and variables X1 and X4 are selected in the clustering combination C1&C3 824. The device can finally select four variables {X1, X3, X4, X5} as main variables, excluding overlapping variables.

9 is a diagram showing an example of another method of extracting a main variable according to an embodiment of the present invention.

Referring to FIG. 9 , the device trains a tree-based classification model (eg, decision tree, ensemble learning, etc.) using labels of each cluster (S900). For example, when N clusters are created as shown in the example of FIG. 6 , data samples belonging to each cluster are labeled with values that distinguish each cluster. That is, a value representing the C1 cluster (eg, a first label) is assigned to data samples belonging to the C1 cluster, and a value representing the C2 cluster (eg, a second label) is assigned to data samples belonging to the C2 cluster. grant The device may train a tree-based classification model using labels assigned to each data sample.

The device calculates the importance of each variable from the trained tree-based classification model (S910). Since the method itself for calculating the importance of each variable in the tree-based classification model is already a well-known technique, description thereof will be omitted. When the importance of variables is calculated, the device selects a predetermined number of variables as main variables in order of importance (S920).

10 is a diagram illustrating an example of a method of extracting a main data sample according to an embodiment of the present invention.

Referring to FIG. 10 , the apparatus extracts data samples having variable values corresponding to the minimum, maximum and/or average of the main variables for each cluster (S1000). For example, in the example of FIG. 8, the main variables are {X1, X3, X4, X5}. In this case, the device extracts data samples with the minimum value, maximum value, or average (or the variable value with the closest average) for variable X1 among data samples belonging to C1 cluster, and the same for main variables X3, X4, and X5. Each data sample is extracted in this way. Data samples are extracted for the C2 and C3 clusters in the same way. This embodiment describes an example of extracting data samples having variable values for the minimum, maximum, and average of each variable, but is not necessarily limited thereto, and extracts data samples having variable values belonging to various values having statistical significance. Can be modified to extract.

The apparatus extracts a certain number of data samples (eg, 500) at random (ie, uniform selection probability) for each cluster (S1010). The number of data samples to be extracted for each cluster may be set in various ways according to embodiments.

The apparatus excludes overlapping data samples from the first data sample group extracted based on the main variable (step S1000) and the second data sample group randomly extracted (step S1010) (S1020). In this way, the device extracts data samples for each cluster. That is, in the case of FIG. 8, main data samples are extracted for each cluster of C1, C2, and C3.

Referring to FIG. 11 , the data visualization device 1100 includes a clustering unit 1110, a variable selection unit 1120, a sample selection unit 1130, and a visualization unit 1140. The data visualization device 1100 may be implemented as a computing device including a memory, processor, input/output device, and the like. In this case, each component may be implemented as software, loaded into a memory, and then driven by a processor.

The clustering unit 1110 clusters a plurality of data samples composed of variable values for a plurality of variables into a plurality of clusters. An example of clustering is shown in FIG. 6

The variable selection unit 1120 determines a main variable representing a difference between the plurality of clusters among a plurality of variables. The variable selector 1120 may determine a variable having a distribution showing a large difference for each cluster as a main variable. An example of this is shown in FIGS. 7 and 8 . Alternatively, the variable selection unit 1120 may determine a variable having a high importance as a main variable in a model for approximating a clustering result. An example of this is shown in FIG. 9 .

The sample selector 1130 extracts a certain number of data samples for each cluster, but extracts data samples including minimum, maximum, or average variable values for the main variables. An example of the sample selection unit is shown in FIG. 10 .

The visualization unit 1140 visualizes and displays the main variables and the extracted main data samples. For example, the visualization unit may visualize main variables and main data samples using the plots of FIGS. 1 to 3 .

Each embodiment of the present invention can also be implemented as computer readable codes on a computer readable recording medium. A computer-readable recording medium includes all types of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, SSD, and optical data storage devices. In addition, the computer-readable recording medium may be distributed to computer systems connected through a network to store and execute computer-readable codes in a distributed manner.

So far, the present invention has been looked at mainly with its preferred embodiments. Those skilled in the art to which the present invention pertains will be able to understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from a descriptive point of view rather than a limiting point of view. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent scope will be construed as being included in the present invention.

Claims

In the data visualization method performed by the data visualization device,

clustering a plurality of data samples composed of variable values for a plurality of variables into a plurality of clusters;

identifying a major variable representing a difference between the plurality of clusters among the plurality of variables;

extracting a certain number of data samples for each cluster, but extracting data samples including minimum, maximum, or average variable values for the main variables; and

Visualizing based on the main variable and the extracted data sample; characterized in that it comprises a data visualization method.
The method of claim 1, wherein the step of identifying the main variable,

generating a plurality of cluster combinations including two clusters among the plurality of clusters;

Comparing distributions of variable values for each variable in two clusters for each cluster combination to identify cluster differences for each variable; and

and selecting a predefined number of variables as main variables based on the size of the cluster difference for each variable in the plurality of cluster combinations.
According to claim 2,

The step of determining the cluster difference includes calculating a p-value of a non-parametric statistical test for comparing the distribution of variable values for each variable in the two clusters;

The step of selecting as the main variables comprises selecting a certain number of variables as main variables in the order of smallest among p-values that do not exceed a predefined threshold value.
The method of claim 1, wherein the extracting step,

extracting a first data sample group including the minimum, maximum, or average variable values for each cluster;

extracting a predefined number of second data sample groups with uniform selection probability for each cluster; and

and discarding redundant data samples of the first data sample group and the second data sample group for each cluster.
The method of claim 1, wherein the visualizing step,

and displaying the main variables and the extracted data samples using a heat map or a parallel coordinate diagram.
The method of claim 1, wherein the step of identifying the main variable,

training a tree-based classification model based on the label of each cluster;

calculating an importance of each variable from the trained tree-based classification model; and

A data visualization method comprising: selecting a predetermined number of variables selected in order of importance as main variables.
a clustering unit that clusters a plurality of data samples composed of variable values for a plurality of variables into a plurality of clusters;

a variable selection unit to determine a main variable representing a difference between the plurality of clusters among the plurality of variables;

For each cluster, a sample selection unit for extracting a certain number of data samples so that data samples including minimum, maximum, or average variable values for the main variable are included; and

Data visualization device comprising a; visualization unit for visualizing based on the main variable and the extracted data sample.
The method of claim 7, wherein the variable selection unit,

After generating a plurality of cluster combinations including two clusters among the plurality of clusters, comparing and identifying the distribution of variable values for each variable of the two clusters for each cluster combination, based on the cluster difference for each variable, a predefined Data visualization device, characterized in that for selecting the number of variables as the main variable.
The method of claim 7, wherein the variable selection unit,

A data visualization device that calculates the importance of each variable from a tree-based classification model trained based on the label of each cluster and selects a certain number of variables as main variables in order of importance.
A computer-readable recording medium on which a computer program for performing the method according to claim 1 is recorded.