WO2023063485A1 - Procédé de visualisation de données et dispositif correspondant - Google Patents

Procédé de visualisation de données et dispositif correspondant Download PDF

Info

Publication number
WO2023063485A1
WO2023063485A1 PCT/KR2021/017808 KR2021017808W WO2023063485A1 WO 2023063485 A1 WO2023063485 A1 WO 2023063485A1 KR 2021017808 W KR2021017808 W KR 2021017808W WO 2023063485 A1 WO2023063485 A1 WO 2023063485A1
Authority
WO
WIPO (PCT)
Prior art keywords
variable
variables
cluster
data
clusters
Prior art date
Application number
PCT/KR2021/017808
Other languages
English (en)
Korean (ko)
Inventor
최유리
피에 로말리자장
Original Assignee
주식회사 솔리드웨어
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 솔리드웨어 filed Critical 주식회사 솔리드웨어
Publication of WO2023063485A1 publication Critical patent/WO2023063485A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/904Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • G06F16/287Visualization; Browsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • Embodiments of the present invention relate to a method and apparatus for visualizing data, and more particularly, to a method and apparatus for visualizing a result of data clustering.
  • Unsupervised learning is often used to understand and model unlabeled data. For example, customer information in a marketing database, a large-scale public survey, and a large-scale chemical compound library test result may be classified into a plurality of clusters using an unsupervised learning model. However, most of the data has a limitation in that it does not contain information that can guide the training of unsupervised learning models.
  • a common way to provide additional information is through visualizations such as diagrams and pictures.
  • Appropriate visualization can help (1) understand data structure, (2) interpret clustering results, (3) compare between clusters, and (4) detect outliers in data.
  • visualization methods for multidimensional data include line plots and scatter plots, projections, Chernoff faces, star and radar plots, correlation plots, There are matrix plots, parallel coordinates, heat maps, etc.
  • this visualization method does not know the target when there is no target variable, so the uncertainty of not knowing what important information to visualize increases.
  • visualizing high-dimensional data i.e. hundreds of variables, is often counterproductive and prevents a good understanding of the results.
  • many visualization methods have limitations on the number of variables that can be displayed simultaneously. In order to represent complex interactions between variables, it is necessary to represent the relationship between variables, but finding such interactions is computationally difficult because the number of interactions is significantly greater than the number of variables.
  • users manually select variables and visualization methods when visualizing data it is difficult to make accurate decisions and errors are likely to occur in situations where the data is not well known.
  • a technical problem to be achieved by an embodiment of the present invention is to provide a method and apparatus for efficiently visualizing by extracting main variables and main samples so that users can better understand the results of data clustering.
  • An example of a data visualization method according to an embodiment of the present invention for achieving the above technical problem is a data visualization method performed by a data visualization device, which includes a plurality of data samples composed of variable values for a plurality of variables. clustering into a plurality of clusters; identifying a major variable representing a difference between the plurality of clusters among the plurality of variables; Extracting a certain number of data samples for each cluster so that data samples including minimum, maximum, or average variable values for the main variable are included; and visualizing based on the main variable and the extracted data sample.
  • An example of a data visualization stop for achieving the above technical problem is a clustering unit for clustering a plurality of data samples composed of variable values for a plurality of variables into a plurality of clusters; a variable selection unit to determine a main variable representing a difference between the plurality of clusters among the plurality of variables; For each cluster, a sample selection unit for extracting a certain number of data samples so that data samples including minimum, maximum, or average variable values for the main variable are included; and a visualization unit for visualizing based on the main variable and the extracted data sample.
  • FIG. 1 to 3 are views showing an example of a plot used for data visualization according to an embodiment of the present invention
  • FIG. 4 is a flowchart illustrating an example of a data visualization method according to an embodiment of the present invention.
  • FIG. 5 is a diagram showing an example of a data sample according to an embodiment of the present invention.
  • FIG. 6 is a diagram showing an example of clustering according to an embodiment of the present invention.
  • FIG. 7 and 8 are diagrams showing an example of a method for identifying key variables according to an embodiment of the present invention.
  • FIG. 9 is a diagram showing an example of another method of extracting a main variable according to an embodiment of the present invention.
  • FIG. 10 is a diagram showing an example of a method of extracting a main data sample according to an embodiment of the present invention.
  • FIG. 11 is a diagram showing the configuration of an example of a data visualization device according to an embodiment of the present invention.
  • FIGS. 1 to 3 are diagrams illustrating an example of a plot used for data visualization according to an embodiment of the present invention.
  • FIG. 1 an example of a heat map displaying 12 variables for 4 clusters is shown.
  • the color reflects the normalized mean value of the variable within the cluster.
  • the heatmap provides an overview of the variable values in each cluster. Heatmaps can be useful for interpreting clusters and understanding differences between clusters, providing an overview of the relationship between different variables at the possible line.
  • FIG. 2 an example of a parallel coordinate diagram is shown.
  • Each line in the parallel coordinate plot represents a data sample. Variables are displayed on the vertical axis, except for the rightmost variable, and the color indicates clustering.
  • a parallel coordinate plot provides a detailed picture of the data samples within a cluster. Parallel coordinate plots can help evaluate the distribution of variable values within each cluster, are useful for interpreting and comparing clusters, and provide an overview of the relationships between different variables.
  • FIG. 3 a projection chart is shown. Each point in the figure is a multivariate data sample projected on a two-dimensional plane, and different colors represent different clusters. Projection charts can help visually assess the topological structure of data in terms of distances between data points, outliers, and density distributions.
  • This embodiment only shows an example of a visualization method for better understanding, but the present invention is not limited to the visualization methods of FIGS. 1 to 3 , and various conventional visualization methods may be applied to this embodiment.
  • FIG. 4 is a flowchart illustrating an example of a data visualization method according to an embodiment of the present invention.
  • a data visualization device clusters a plurality of data samples into a plurality of clusters (S400).
  • the data sample is data composed of variable values for a plurality of variables, and an example thereof is shown in FIG. 5 .
  • the device may perform clustering using various conventional clustering algorithms (eg, k-mean, etc.) such as an unsupervised learning model, and an example thereof is shown in FIG. 6 .
  • the device determines a major variable representing a difference between a plurality of clusters among a plurality of variables constituting the data sample (S410).
  • the number of main variables may be set in various ways according to embodiments. For example, if the number of variables in the data sample is 100, the device may set 10 as the number of main variables. An example of a specific method for determining a major variable will be reviewed again in FIGS. 7 to 9 .
  • the apparatus extracts a certain number of data samples for each cluster, but extracts data samples including at least one variable value among the minimum, maximum, and average values of the main variables (S420).
  • the number of data samples to be extracted may be variously set according to the embodiment. A specific example of a data sample extraction method will be reviewed again in FIG. 10 .
  • the device When the main variables and main data samples are extracted, the device performs visualization based on them and displays them (S430). For example, the device may display main variables and main data samples using various plots shown in FIGS. 1 to 3 .
  • FIG. 5 is a diagram showing an example of a data sample according to an embodiment of the present invention.
  • a data set 500 includes a plurality of data samples 520 .
  • Each data sample 520 includes variable values for a plurality of variables 510 .
  • the data set 500 of this embodiment includes M data samples 520, and each data sample includes variable values for n variables 510.
  • This embodiment is only an example to aid understanding, and the shape of the dataset may be variously modified according to the embodiment.
  • FIG. 6 is a diagram illustrating an example of clustering according to an embodiment of the present invention.
  • the device classifies a plurality of data samples 600 into a plurality of clusters 610 , 620 , and 630 .
  • the device may cluster data samples using various conventional clustering algorithms (eg, k-means, etc.).
  • the number of clusters 610, 620, and 630 may be set by a user or automatically.
  • FIG. 7 and 8 are diagrams illustrating an example of a method for determining a major variable according to an embodiment of the present invention.
  • the device generates a plurality of cluster combinations including at least two of the plurality of clusters (S700). For example, if a plurality of data samples are clustered into three clusters as shown in FIG. 8, three different combinations of (C1, C2), (C2, C3), and (C1, C3) are generated. The number of cluster combinations depends on the number of clusters.
  • the device compares the distribution of variable values for each variable in the two clusters to determine the cluster difference for each variable (S710). For example, referring to FIG. 8 , the device determines the cluster difference of each variable (X1 to X5) 810 for the C1&C2 cluster combination 820. That is, the distribution of the X1 variable values of the data samples belonging to the C1 cluster is compared with the distribution of the X1 variable values of the data samples belonging to the C2 cluster, and the difference is identified through a statistical method.
  • Various statistical methods for calculating the difference in the distribution of variable values for each variable in the two clusters may be applied to this embodiment. For example, the device may calculate a p-value of non-parametric statistical test comparing the distribution of variable values for each variable in two clusters of each cluster combination (820, 822, 824) and use it as a cluster difference.
  • the device selects a predefined number of variables as main variables based on the size of the cluster difference for each variable in a plurality of cluster combinations (S720). For example, the device may select as a main variable a variable showing a large difference in the distribution of values of each variable of the two clusters of each cluster combination, or select a variable with high importance as a main variable in a model approximating the clustering result.
  • the cluster difference of each variable 810 of two clusters of each cluster combination (820, 822, 824) is calculated and displayed as a p-value of non-parametric statistical test. Since the method for calculating the p-value of the non-parametric statistical test is already widely known, the description thereof will be omitted.
  • the p-value of comparing the distribution of the variable value of the X1 variable in the data samples belonging to the C1 cluster and the distribution of the variable value of the X1 variable in the data samples belonging to the C2 cluster is 0.131, and the variables X2 and X3
  • the p-values of ,X4,X5 are 0.185, 0.021, 0.082, and 0.016, respectively.
  • the device identifies a predefined number of p-values in order of smaller order among p-values that do not exceed a preset threshold value (eg, 0.05). 8 shows a case in which five p-values 830 selected in descending order are selected. The number of selected p-values may be variously modified according to embodiments.
  • a preset threshold value eg, 0.05
  • the device may select each variable corresponding to the selected p-value as the main variable. For example, variables X3 and X5 are selected in the clustering combination C1&C2 820, variable X3 is selected in the clustering combination C2&C3 822, and variables X1 and X4 are selected in the clustering combination C1&C3 824.
  • the device can finally select four variables ⁇ X1, X3, X4, X5 ⁇ as main variables, excluding overlapping variables.
  • FIG. 9 is a diagram showing an example of another method of extracting a main variable according to an embodiment of the present invention.
  • the device trains a tree-based classification model (eg, decision tree, ensemble learning, etc.) using labels of each cluster (S900). For example, when N clusters are created as shown in the example of FIG. 6 , data samples belonging to each cluster are labeled with values that distinguish each cluster. That is, a value representing the C1 cluster (eg, a first label) is assigned to data samples belonging to the C1 cluster, and a value representing the C2 cluster (eg, a second label) is assigned to data samples belonging to the C2 cluster. grant The device may train a tree-based classification model using labels assigned to each data sample.
  • a tree-based classification model eg, decision tree, ensemble learning, etc.
  • the device calculates the importance of each variable from the trained tree-based classification model (S910). Since the method itself for calculating the importance of each variable in the tree-based classification model is already a well-known technique, description thereof will be omitted.
  • the device selects a predetermined number of variables as main variables in order of importance (S920).
  • FIG. 10 is a diagram illustrating an example of a method of extracting a main data sample according to an embodiment of the present invention.
  • the apparatus extracts data samples having variable values corresponding to the minimum, maximum and/or average of the main variables for each cluster (S1000).
  • the main variables are ⁇ X1, X3, X4, X5 ⁇ .
  • the device extracts data samples with the minimum value, maximum value, or average (or the variable value with the closest average) for variable X1 among data samples belonging to C1 cluster, and the same for main variables X3, X4, and X5.
  • Each data sample is extracted in this way.
  • Data samples are extracted for the C2 and C3 clusters in the same way.
  • This embodiment describes an example of extracting data samples having variable values for the minimum, maximum, and average of each variable, but is not necessarily limited thereto, and extracts data samples having variable values belonging to various values having statistical significance. Can be modified to extract.
  • the apparatus extracts a certain number of data samples (eg, 500) at random (ie, uniform selection probability) for each cluster (S1010).
  • the number of data samples to be extracted for each cluster may be set in various ways according to embodiments.
  • the apparatus excludes overlapping data samples from the first data sample group extracted based on the main variable (step S1000) and the second data sample group randomly extracted (step S1010) (S1020). In this way, the device extracts data samples for each cluster. That is, in the case of FIG. 8, main data samples are extracted for each cluster of C1, C2, and C3.
  • FIG. 11 is a diagram showing the configuration of an example of a data visualization device according to an embodiment of the present invention.
  • the data visualization device 1100 includes a clustering unit 1110, a variable selection unit 1120, a sample selection unit 1130, and a visualization unit 1140.
  • the data visualization device 1100 may be implemented as a computing device including a memory, processor, input/output device, and the like. In this case, each component may be implemented as software, loaded into a memory, and then driven by a processor.
  • the clustering unit 1110 clusters a plurality of data samples composed of variable values for a plurality of variables into a plurality of clusters. An example of clustering is shown in FIG. 6
  • the variable selection unit 1120 determines a main variable representing a difference between the plurality of clusters among a plurality of variables.
  • the variable selector 1120 may determine a variable having a distribution showing a large difference for each cluster as a main variable. An example of this is shown in FIGS. 7 and 8 .
  • the variable selection unit 1120 may determine a variable having a high importance as a main variable in a model for approximating a clustering result. An example of this is shown in FIG. 9 .
  • the sample selector 1130 extracts a certain number of data samples for each cluster, but extracts data samples including minimum, maximum, or average variable values for the main variables.
  • An example of the sample selection unit is shown in FIG. 10 .
  • the visualization unit 1140 visualizes and displays the main variables and the extracted main data samples.
  • the visualization unit may visualize main variables and main data samples using the plots of FIGS. 1 to 3 .
  • Each embodiment of the present invention can also be implemented as computer readable codes on a computer readable recording medium.
  • a computer-readable recording medium includes all types of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, SSD, and optical data storage devices.
  • the computer-readable recording medium may be distributed to computer systems connected through a network to store and execute computer-readable codes in a distributed manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Complex Calculations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Sont divulgués un procédé de visualisation de données et un dispositif correspondant. Le dispositif de visualisation de données groupe, en une pluralité de grappes, une pluralité d'échantillons de données présentant des valeurs variables pour une pluralité de variables, identifie, parmi la pluralité de variables, des variables principales indiquant la différence entre la pluralité de grappes, extrait un certain nombre d'échantillons de données pour chaque grappe, l'extraction permettant l'inclusion d'échantillons de données comprenant des valeurs variables minimales, maximales ou moyennes correspondant aux variables principales, et effectue une visualisation sur la base des variables principales et des échantillons de données extraits.
PCT/KR2021/017808 2021-10-14 2021-11-30 Procédé de visualisation de données et dispositif correspondant WO2023063485A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2021-0136830 2021-10-14
KR1020210136830A KR20230053384A (ko) 2021-10-14 2021-10-14 데이터 시각화 방법 및 그 장치

Publications (1)

Publication Number Publication Date
WO2023063485A1 true WO2023063485A1 (fr) 2023-04-20

Family

ID=85988715

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/017808 WO2023063485A1 (fr) 2021-10-14 2021-11-30 Procédé de visualisation de données et dispositif correspondant

Country Status (2)

Country Link
KR (1) KR20230053384A (fr)
WO (1) WO2023063485A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170079159A (ko) * 2015-12-30 2017-07-10 주식회사 솔리드웨어 빅데이터와 기계학습을 이용한 타겟 정보 예측 시스템 및 예측 방법
US20170364590A1 (en) * 2016-06-20 2017-12-21 Dell Software, Inc. Detecting Important Variables and Their Interactions in Big Data
US10176435B1 (en) * 2015-08-01 2019-01-08 Shyam Sundar Sarkar Method and apparatus for combining techniques of calculus, statistics and data normalization in machine learning for analyzing large volumes of data
KR101976689B1 (ko) * 2018-11-29 2019-05-09 주식회사 솔리드웨어 데이터 모델링을 위한 변수 자동생성방법 및 그 장치
KR102039154B1 (ko) * 2019-04-30 2019-10-31 서울시립대학교 산학협력단 데이터를 시각화하는 장치 및 방법

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10176435B1 (en) * 2015-08-01 2019-01-08 Shyam Sundar Sarkar Method and apparatus for combining techniques of calculus, statistics and data normalization in machine learning for analyzing large volumes of data
KR20170079159A (ko) * 2015-12-30 2017-07-10 주식회사 솔리드웨어 빅데이터와 기계학습을 이용한 타겟 정보 예측 시스템 및 예측 방법
US20170364590A1 (en) * 2016-06-20 2017-12-21 Dell Software, Inc. Detecting Important Variables and Their Interactions in Big Data
KR101976689B1 (ko) * 2018-11-29 2019-05-09 주식회사 솔리드웨어 데이터 모델링을 위한 변수 자동생성방법 및 그 장치
KR102039154B1 (ko) * 2019-04-30 2019-10-31 서울시립대학교 산학협력단 데이터를 시각화하는 장치 및 방법

Also Published As

Publication number Publication date
KR20230053384A (ko) 2023-04-21

Similar Documents

Publication Publication Date Title
US20230368869A1 (en) Systems and methods for visualization of single-cell resolution characteristics
US20230102326A1 (en) Discovering population structure from patterns of identity-by-descent
Abdelaal et al. Predicting cell populations in single cell mass cytometry data
US11804069B2 (en) Image clustering method and apparatus, and storage medium
Guo Coordinating computational and visual approaches for interactive feature selection and multivariate clustering
Chang et al. A robust dynamic niching genetic algorithm with niche migration for automatic clustering problem
CN109977132B (zh) 一种基于无监督聚类模式的学生异常行为模式分析方法
KR19990083199A (ko) 정보 분석 방법 및 그 제조물
CN112926045B (zh) 一种基于逻辑回归模型的群控设备识别方法
KR102163718B1 (ko) 설문조사 부정 응답자 판별 ai 프로그램
JP2024009155A (ja) システム及び情報処理方法
Nama et al. Implementation of K-Means Technique in Data Mining to Cluster Researchers Google Scholar Profile
WO2023063486A1 (fr) Procédé de création de modèle d'apprentissage automatique et dispositif associé
WO2023063485A1 (fr) Procédé de visualisation de données et dispositif correspondant
WO2022114363A1 (fr) Procédé et appareil basés sur un apprentissage non supervisé pour générer un modèle d'apprentissage supervisé, et procédé et appareil pour analyser un modèle d'apprentissage non supervisé à l'aide de celui-ci
EP3438987B1 (fr) Outil d'analyse de données d'images, de génération et d'affichage d'un indicateur de confiance avec un score de cancer
KR102085161B1 (ko) 그래프 데이터 시각화 시스템 및 방법과, 이를 위한 컴퓨터 프로그램
CN109544582B (zh) 一种基于萤火虫优化的半监督谱聚类彩色图像分割方法
KR20190000169A (ko) 암의 재발 예후 예측을 위한 바이오마커 발굴 시스템 및 방법
CN116401828A (zh) 基于数据特征的关键事件可视化显示方法
WO2018165530A1 (fr) Procédé de construction d'une carte à faible dimensionnalité réutilisable de données à haute dimensionnalité
WO2022114364A1 (fr) Procédé et appareil pour appliquer une intention d'utilisateur dans un apprentissage non supervisé
WO2022211179A1 (fr) Procédé de recherche de modèle optimal et dispositif associé
Kano et al. Visualization for genome function analysis using immersive projection technology
Leng et al. EBSeqHMM: An R package for identifying gene-expression changes in ordered RNA-seq experiments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21960750

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21960750

Country of ref document: EP

Kind code of ref document: A1