KR102199506B1

KR102199506B1 - Method and device for educating and comparing clustering algorithms, recording medium for performing the method

Info

Publication number: KR102199506B1
Application number: KR1020180134548A
Authority: KR
Inventors: 정순영; 김민석; 최우성
Original assignee: 고려대학교 산학협력단
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2021-01-06
Also published as: KR20200054364A

Abstract

클러스터링 알고리즘 중에서 하나 또는 두 개의 알고리즘을 선택하고, 선택된 알고리즘에서 분석을 위해 필요한 파라미터를 지정하는 선택부, 데이터 분포를 선택하고, 선택된 분포를 따르는 데이터를 자동으로 생성하여 데이터 셋에 추가하는 데이터 생성부, 데이터 셋을 선택부에서 지정한 알고리즘 및 파라미터에 따라 데이터를 분석하여 클러스터를 생성하고, 생성된 각각의 클러스터에 색상을 지정하는 클러스터부, 클러스터부에서 지정한 색상을 두 개로 나뉘어진 화면에서 스캐터 플롯(scatter plot)을 이용하여 각 알고리즘의 동작 과정을 동시에 나타내는 시각화부 및 시각화부의 두 개의 화면에 나타난 결과로부터 선택부에서 지정한 알고리즘과 파라미터의 영향을 학습하는 학습부를 포함하는 클러스터링 알고리즘의 비교 학습 장치를 개시하고 있다.A selection unit that selects one or two algorithms among clustering algorithms and specifies parameters necessary for analysis from the selected algorithm, and a data generation unit that automatically generates data following the selected distribution and adds it to the data set. , A cluster is created by analyzing the data according to the algorithm and parameters specified in the data set selection section, and a cluster section that assigns colors to each created cluster, and a scatter plot on the screen divided into two colors specified by the cluster section. A comparison learning device for clustering algorithms including a visualization unit that simultaneously shows the operation process of each algorithm using (scatter plot) and a learning unit that learns the influence of the algorithm and parameters specified by the selection unit from the results displayed on two screens of the visualization unit. It is starting.

Description

Comparison learning device and method of clustering algorithm, and recording medium for performing the method {METHOD AND DEVICE FOR EDUCATING AND COMPARING CLUSTERING ALGORITHMS, RECORDING MEDIUM FOR PERFORMING THE METHOD}

본 발명은 클러스터링 알고리즘의 비교 학습 장치 및 방법, 그 방법을 수행하기 위한 기록 매체에 관한 것으로서, 더욱 상세하게는 하나의 데이터 셋에 대해 여러 클러스터링 알고리즘의 동작 과정을 동시에 시각화하여 알고리즘 간의 비교 분석 및 학습을 용이하게 해주는 클러스터링 알고리즘의 비교 학습 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for comparative learning of clustering algorithms, and to a recording medium for performing the method, and more particularly, to compare and analyze and learn between algorithms by simultaneously visualizing the operation process of several clustering algorithms for one data set. It relates to a comparative learning apparatus and method of a clustering algorithm that facilitates.

클러스터(cluster)란 거리가 가까운 객체들의 집합 또는 객체가 높은 밀도로 분포한 지역으로 정의된다.A cluster is defined as a set of objects with close distance or an area in which objects are distributed with high density.

클러스터 분석은 데이터 마이닝(data mining)의 핵심 분야 중 하나로, 주어진 데이터 집합에서 클러스터를 찾아내는 분석이다. 데이터 집합에서 클러스터를 자동으로 찾아냄으로써 데이터 분석 효율을 높여주는 클러스터 분석은 인공지능, 빅 데이터(big data) 등에서의 빈번히 사용된다. Cluster analysis is one of the key areas of data mining, which is an analysis that finds clusters in a given data set. Cluster analysis, which improves data analysis efficiency by automatically finding clusters in a data set, is frequently used in artificial intelligence and big data.

클러스터링 알고리즘은 크게 분할 기법(Partitioning algorithms), 계층 기법(Hierarchical algorithms), 밀도 기반 기법(Density-based algorithms)의 세 부류가 있다.Clustering algorithms are largely divided into three categories: Partitioning algorithms, Hierarchical algorithms, and Density-based algorithms.

분할 기법은 보통 임의의 초기 분할 상태(partitioning)로부터 시작해서, iterative한 방식으로 분할 상태를 조정하여 특정 목적 함수를 최소화하는 방향으로 진행된다. The partitioning technique usually starts from an arbitrary initial partitioning and proceeds toward minimizing a specific objective function by adjusting the partitioning state in an iterative manner.

계층 기법(Hierarchical algorithms)은 데이터 셋에 대해 계층적인 형태의 분할 상태를 생성하는 것으로 dendrogram을 사용하여 트리 형태로 표현이 가능하다. Hierarchical algorithms generate hierarchical partition states for a data set, and can be expressed in a tree form using a dendrogram.

밀도 기반 기법(Density-based algorithms)은 가장 최근에 연구되고 개발된 클러스터링 알고리즘들이며, 클러스터의 모양에 관계없이 클러스터링이 가능하다는 장점이 있다. 여기서 클러스터는 다른 지역보다 유난히 밀도가 높은 지역으로 정의되고, 밀도가 어느 정도 낮은 지역의 데이터는 잡음(noise)으로 취급되고 이는 어떠한 클러스터에도 속하지 않게 된다. Density-based algorithms are the most recently researched and developed clustering algorithms, and have the advantage that clustering is possible regardless of the shape of the cluster. Here, the cluster is defined as an area that is exceptionally denser than other areas, and the data of an area with a certain low density is treated as noise and does not belong to any cluster.

본 발명의 기술적 과제는 이러한 점에서 착안된 것으로 본 발명의 목적은, 클러스터링 알고리즘의 비교 학습 장치 및 방법, 그 방법을 수행하기 위한 기록 매체를 제공하는 것이다.The technical problem of the present invention is conceived in this respect, and an object of the present invention is to provide an apparatus and method for comparative learning of a clustering algorithm, and a recording medium for performing the method.

본 발명의 기술적 과제는 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problem of the present invention is not limited to the technical problem mentioned above, and other technical problems that are not mentioned will be clearly understood by those skilled in the art from the following description.

상기한 본 발명의 목적을 실현하기 위한 일 실시 예에 따른 클러스터링 알고리즘 중에서 하나 또는 두 개의 알고리즘을 선택하고, 선택된 알고리즘에서 분석을 위해 필요한 파라미터를 지정하는 선택부; 데이터 분포를 선택하고, 선택된 분포를 따르는 데이터를 자동으로 생성하여 데이터 셋에 추가하는 데이터 생성부; 상기 데이터 셋을 상기 선택부에서 지정한 알고리즘 및 파라미터에 따라 데이터를 분석하여 클러스터를 생성하고, 상기 생성된 각각의 클러스터에 색상을 지정하는 클러스터부; 상기 클러스터부에서 지정한 색상을 두 개로 나뉘어진 화면에서 스캐터 플롯(scatter plot)을 이용하여 각 알고리즘의 동작 과정을 동시에 나타내는 시각화부; 및 상기 시각화부의 두 개의 화면에 나타난 결과로부터 상기 선택부에서 지정한 알고리즘과 파라미터의 영향을 학습하는 학습부;를 포함한다.A selection unit for selecting one or two algorithms from among clustering algorithms according to an embodiment for realizing the object of the present invention and designating parameters necessary for analysis in the selected algorithm; A data generator for selecting a data distribution, automatically generating data according to the selected distribution, and adding it to a data set; A cluster unit configured to generate a cluster by analyzing data on the data set according to an algorithm and parameter designated by the selection unit, and designating a color to each of the generated clusters; A visualization unit that simultaneously displays an operation process of each algorithm by using a scatter plot on a screen divided into two colors designated by the cluster unit; And a learning unit that learns the influence of the algorithm and parameter specified by the selection unit from the results displayed on the two screens of the visualization unit.

본 발명의 실시 예에서, 상기 클러스터링 알고리즘은, 분할 기법(Partitioning algorithms), 계층 기법(Hierarchical algorithms) 및 밀도 기반 기법(Density-based algorithms)을 포함할 수 있다.In an embodiment of the present invention, the clustering algorithm may include partitioning algorithms, hierarchical algorithms, and density-based algorithms.

본 발명의 실시 예에서, 상기 클러스터부는, 클러스터링 시뮬레이션을 수행하는 경우, 상기 데이터 생성부에 의해 데이터 셋에 데이터가 추가될 때마다 상기 선택부에서 선택된 알고리즘과 파라미터에 따라 데이터를 분석하여 클러스터를 다시 생성하고, 상기 다시 생성된 클러스터에 따라 각 점이 속한 클러스터가 변화하면, 변화된 클러스터에 대응하여 색을 지속적으로 변화시킬 수 있다.In an embodiment of the present invention, when performing a clustering simulation, the cluster unit analyzes the data according to the algorithm and parameters selected by the selection unit whenever data is added to the data set by the data generation unit to rebuild the cluster. When the cluster to which each point belongs changes according to the generated and regenerated cluster, the color may be continuously changed in response to the changed cluster.

본 발명의 실시 예에서, 상기 시각화부는, 상기 선택부에서 계층 기법(Hierarchical algorithms)에 속하는 알고리즘을 선택한 경우, 선택한 알고리즘에 대해 덴드로그램 플롯(Dendrogram plot)으로 나타낼 수 있다.In an embodiment of the present invention, when the selection unit selects an algorithm belonging to hierarchical algorithms, the visualization unit may display the selected algorithm as a dendrogram plot.

본 발명의 실시 예에서, 상기 학습부는, 이미 클러스터링 시뮬레이션을 수행하여 클러스터 결과를 알고 있는 샘플 분포 데이터 셋을 읽어 들여 클러스터링 시뮬레이션을 수행하는 경우, 상기 클러스터 결과 및 상기 시각화부의 두 개의 화면에 나타난 시뮬레이션의 클러스터 결과를 비교하여 차이를 수치, 표 및 그래프 중 어느 하나로부터 상기 선택부에서 지정한 알고리즘과 파라미터의 영향을 학습할 수 있다.In an embodiment of the present invention, when performing a clustering simulation by reading a sample distribution data set that already performs a clustering simulation and knowing the cluster result, the learning unit performs the clustering simulation and the simulation displayed on two screens of the visualization unit. By comparing the cluster results, it is possible to learn the influence of the algorithm and parameters designated by the selection unit from any one of numerical values, tables, and graphs.

본 발명의 실시 예에서, 상기 클러스터링 알고리즘은, 자바(JAVA)언어로 구현되어 미리 구현된 추상 클래스(abstract class)를 상속받아 재정의하여 추가할 수 있다.In an embodiment of the present invention, the clustering algorithm may be added by inheriting and redefining an abstract class implemented in Java language and implemented in advance.

상기한 본 발명의 목적을 실현하기 위한 일 실시 예에 따른 클러스터링 알고리즘 중에서 하나 또는 두 개의 알고리즘을 선택하는 단계; 상기 선택된 알고리즘에서 분석을 위해 필요한 파라미터를 선택하는 단계; 데이터 분포를 선택하고, 선택된 분포를 따르는 데이터를 자동으로 생성하여 데이터 셋에 추가하는 단계; 상기 생성된 데이터 셋을 상기 선택된 알고리즘 및 파라미터에 따라 데이터를 분석하여 클러스터를 생성하고, 상기 생성된 각각의 클러스터에 색상을 지정하는 단계; 상기 지정된 색상과 좌표를 두 개의 화면에서 스캐터 플롯(scatter plot)으로 각 알고리즘의 동작과정을 동시에 보여주는 단계; 및 상기 두 개의 화면에 나타난 결과로부터 상기 선택된 알고리즘과 파라미터의 영향을 학습하는 단계;를 포함한다.Selecting one or two algorithms among clustering algorithms according to an embodiment for realizing the object of the present invention; Selecting parameters necessary for analysis in the selected algorithm; Selecting a data distribution, automatically generating data according to the selected distribution, and adding it to the data set; Analyzing the data of the generated data set according to the selected algorithm and parameter to generate a cluster, and designating a color for each of the generated clusters; Simultaneously showing the operation process of each algorithm as a scatter plot on two screens of the designated color and coordinates; And learning the influence of the selected algorithm and parameter from the results displayed on the two screens.

본 발명의 실시 예에서, 상기 클러스터링 데이터 각각에 색상을 지정하는 단계는, 클러스터링 시뮬레이션 수행하는 경우, 데이터가 추가될 때마다 상기 선택된 알고리즘과 파라미터에 따라 데이터를 분석하여 클러스터를 다시 생성하고, 상기 다시 생성된 클러스터에 따라 각 점이 속한 클러스터들이 변화하면, 변화된 클러스터에 대응하여 점들의 색을 지속적으로 변화시킬 수 있다.In an embodiment of the present invention, the step of designating a color to each of the clustering data is, in the case of performing a clustering simulation, whenever data is added, the data is analyzed according to the selected algorithm and parameter to regenerate the cluster, and the When the clusters to which each point belongs changes according to the generated cluster, the color of the dots may be continuously changed in response to the changed cluster.

본 발명의 실시 예에서, 상기 알고리즘의 동작과정을 동시에 보여주는 단계는, 상기 알고리즘을 선택하는 단계에서 계층 기법(Hierarchical algorithms)에 속하는 알고리즘을 선택한 경우, 선택한 알고리즘에 대해 덴드로그램 플롯(Dendrogram plot)으로 표현할 수 있다.In an embodiment of the present invention, the step of simultaneously showing the operation of the algorithm is a dendrogram plot for the selected algorithm when an algorithm belonging to hierarchical algorithms is selected in the step of selecting the algorithm. I can express it.

본 발명의 실시 예에서, 상기 선택된 알고리즘과 파라미터의 영향을 학습하는 단계는, 이미 클러스터링 시뮬레이션을 수행하여 클러스터 결과를 알고 있는 샘플 분포 데이터 셋을 읽어 들여 클러스터링 시뮬레이션을 수행하는 경우, 상기 클러스터 결과 및 상기 두 개의 화면에 나타난 시뮬레이션의 클러스터 결과를 비교하여 차이를 수치, 표 및 그래프 중 어느 하나로부터 상기 선택부에서 지정한 알고리즘과 파라미터의 영향을 학습할 수 있다.In an embodiment of the present invention, the learning of the influence of the selected algorithm and parameter may include performing clustering simulation by reading a sample distribution data set that already knows the cluster result by performing clustering simulation, and performing the clustering simulation. By comparing the cluster results of the simulations displayed on the two screens, it is possible to learn the influence of the algorithm and parameters specified by the selection unit from any one of numerical values, tables, and graphs for differences.

상기한 본 발명의 목적을 실현하기 위한 일 실시 예에 따른 컴퓨터로 판독 가능한 저장 매체에는 따른 클러스터링 알고리즘의 비교 학습 방법을 실행하기 위한 컴퓨터 프로그램이 기록되어 있다.In a computer-readable storage medium according to an embodiment for realizing the object of the present invention, a computer program for executing a comparative learning method of a clustering algorithm is recorded.

상술한 본 발명의 일 측면에 따르면, 클러스터링 알고리즘의 비교 학습 장치 및 방법에 의해 제공되는 효과는 알고리즘과 파라미터를 선택한 후 클러스터링 시뮬레이션 수행 과정 및 결과를 두 개의 화면에서 동시에 시각화함으로써 알고리즘 및 파라미터 선택의 영향을 비교를 통해 학습하게 하여 학습 효율을 높이고, 클러스터 분석에 있어 비전문가더라도 데이터 분포에 따른 적절한 알고리즘 및 파라미터를 선택할 수 있게 하는 다목적 최적화 작업(multi-objective optimization)을 가능하게 하는 유리한 효과가 있다.According to an aspect of the present invention described above, the effect provided by the apparatus and method for comparative learning of the clustering algorithm is the effect of the algorithm and parameter selection by simultaneously visualizing the clustering simulation execution process and results on two screens after selecting the algorithm and parameter. There is an advantageous effect of enabling multi-objective optimization to increase learning efficiency by allowing to learn through comparison, and enabling multi-objective optimization to enable even non-experts to select appropriate algorithms and parameters according to data distribution in cluster analysis.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtainable in the present invention are not limited to the above-mentioned effects, and other effects not mentioned will be clearly understood by those of ordinary skill in the technical field to which the present invention belongs from the following description. .

도 1은 본 발명의 일 실시 예에 따른 클러스터링 알고리즘의 비교 학습 장치의 개략적인 블록도이다.
도 2는 같은 데이터 셋에 대해 다른 클러스터링 알고리즘을 적용한 예를 나타내는 산점도이다.
도 3은 계층 기법(Hierarchical clustering)에 따른 분석 결과를 나타내는 Dendrogram 도표이다.
도 4 및 도 5는 본 발명의 클러스터링 알고리즘의 비교 시뮬레이션 실행 결과의 예를 나타내는 도면들이다.
도 6은 클러스터링 알고리즘의 클래스 다이어그램(class diagram)이다.
도 7은 본 발명의 일 실시 예에 따른 클러스터링 알고리즘의 비교 학습 방법의 개략적인 흐름을 나타내는 순서도이다.1 is a schematic block diagram of an apparatus for comparative learning of a clustering algorithm according to an embodiment of the present invention.
2 is a scatter plot showing an example of applying different clustering algorithms to the same data set.
3 is a Dendrogram chart showing the analysis result according to hierarchical clustering.
4 and 5 are diagrams showing examples of a comparison simulation execution result of the clustering algorithm of the present invention.
6 is a class diagram of a clustering algorithm.
7 is a flowchart illustrating a schematic flow of a comparative learning method of a clustering algorithm according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시 예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시 예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시 예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시 예와 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시 예로 구현될 수 있다. 또한, 각각의 개시된 실시 예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.For a detailed description of the present invention to be described below, reference is made to the accompanying drawings, which illustrate specific embodiments in which the present invention may be practiced. These embodiments are described in detail sufficient to enable a person skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different from each other, but need not be mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the present invention in relation to one embodiment. In addition, it is to be understood that the location or arrangement of individual components in each disclosed embodiment may be changed without departing from the spirit and scope of the present invention. Accordingly, the detailed description to be described below is not intended to be taken in a limiting sense, and the scope of the present invention, if properly described, is limited only by the appended claims, along with all scopes equivalent to those claimed by the claims. Like reference numerals in the drawings refer to the same or similar functions over several aspects.

이하, 도면들을 참조하여 본 발명의 바람직한 실시 예들을 보다 상세하게 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the drawings.

도 1은 본 발명의 일 실시 예에 따른 클러스터링 알고리즘의 비교 학습 장치의 개략적인 블록도이다.1 is a schematic block diagram of an apparatus for comparative learning of a clustering algorithm according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시 예에 따른 클러스터링 알고리즘의 비교 학습 장치(100)는 선택부(110), 데이터 생성부(120), 클러스터부(130), 시각화부(140) 및 학습부(150)를 포함할 수 있다. 클러스터링 알고리즘의 비교 학습 장치(100)는 별도의 단말이거나 또는 단말의 일부 모듈일 수 있다. 도 1에 도시된 선택부(110), 데이터 생성부(120), 클러스터부(130), 시각화부(140) 및 학습부(150)의 구성은 통합 모듈로 형성되거나, 하나 이상의 모듈로 이루어질 수 있다. 그러나, 이와 반대로 각 구성은 별도의 모듈로 이루어질 수도 있다.Referring to FIG. 1, the apparatus 100 for comparing and learning a clustering algorithm according to an embodiment of the present invention includes a selection unit 110, a data generation unit 120, a cluster unit 130, a visualization unit 140, and learning. It may include a unit 150. The apparatus 100 for learning comparison of a clustering algorithm may be a separate terminal or some modules of the terminal. The configuration of the selection unit 110, the data generation unit 120, the cluster unit 130, the visualization unit 140, and the learning unit 150 shown in FIG. 1 may be formed as an integrated module or may be formed of one or more modules. have. However, on the contrary, each component may be formed as a separate module.

본 발명의 클러스터링 알고리즘의 비교 학습 장치(100)는 작업 기억 능력 측정을 수행하기 위한 소프트웨어(어플리케이션)가 설치되어 실행될 수 있으며, 선택부(110), 데이터 생성부(120), 클러스터부(130), 시각화부(140) 및 학습부(150)의 구성은 장치(100)에서 실행되는 소프트웨어에 의해 제어될 수 있다.The comparative learning apparatus 100 of the clustering algorithm of the present invention may be installed and executed with software (application) for measuring working memory ability, and the selection unit 110, the data generation unit 120, and the cluster unit 130 , The configuration of the visualization unit 140 and the learning unit 150 may be controlled by software running on the device 100.

선택부(110)는 사용자는 비교하고 싶은 하나 또는 두 개의 클러스터링 알고리즘을 선택하고, 각 알고리즘의 세부 파라미터와 수행 속도를 조정할 수 있다. 이 때 굳이 다른 알고리즘을 선택할 필요는 없으며 같은 알고리즘에 대해 다른 파라미터 값들을 입력하여 알고리즘 동작에 미치는 파라미터의 영향을 학습할 수 있다. The selection unit 110 may select one or two clustering algorithms to be compared, and adjust detailed parameters and execution speed of each algorithm. At this time, it is not necessary to select a different algorithm, and by entering different parameter values for the same algorithm, the influence of the parameter on the algorithm operation can be learned.

이 때 클러스터링 알고리즘은 매우 다양한데 본 발명에서는 크게 분할 기법(Partitioning algorithms), 밀도 기반 기법(Density-based algorithms) 및 계층 기법(Hierarchical algorithms)으로 분류한다.At this time, clustering algorithms are very diverse, and in the present invention, they are largely classified into partitioning algorithms, density-based algorithms, and hierarchical algorithms.

분할 기법(Partitioning algorithms)은 보통 임의의 초기 분할 상태(partitioning)로부터 시작해서, 반복적(iterative)인 방식으로 분할 상태를 조정하여 특정 목적 함수를 최소화하는 방향으로 진행된다. 예를 들어 대표적인 분할 기법인 k-means는 각 클러스터와 점 간의 거리의 분산을 최소화하는 것이 목적이다. 이 때 총 분할 구역(cluster)은 k-means의 k값처럼 사전에 미리 정해진 값을 따라야 한다.Partitioning algorithms usually start from an arbitrary initial partitioning state and proceed in the direction of minimizing a specific objective function by adjusting the partitioning state in an iterative manner. For example, k-means, a representative partitioning technique, aims to minimize the variance of the distance between each cluster and points. In this case, the total cluster must follow a predetermined value, such as the k value of k-means.

밀도 기반 기법(Density-based algorithms)은 가장 최근에 연구되고 개발된 클러스터링 알고리즘들이며, 클러스터의 모양에 관계없이 클러스터링이 가능하다. 여기서 클러스터는 다른 지역보다 유난히 밀도가 높은 지역으로 정의되고, 밀도가 어느 정도 낮은 지역의 데이터는 소음(noise)으로 취급되고 이는 어떠한 클러스터에도 속하지 않게 된다. 예를 들어, DBSCAN에서는 특정 거리 조건과 밀도 조건(density-reachability)에 의해 두 데이터 점들을 계속 연결해 가며 밀도 높은 지역을 찾고 클러스터로 구분 짓는다.Density-based algorithms are the most recently studied and developed clustering algorithms, and clustering is possible regardless of the shape of the cluster. Here, the cluster is defined as an area that is exceptionally denser than other areas, and the data of an area with a certain low density is treated as noise, and this does not belong to any cluster. For example, DBSCAN continuously connects two data points according to a specific distance condition and density condition (density-reachability) to find a dense area and classify it into a cluster.

도 2는 같은 데이터 셋에 대해 다른 클러스터링 알고리즘을 적용한 예를 나타내는 스캐터 플롯(scatter plot, 산점도) 이다.2 is a scatter plot (scatter plot) showing an example of applying different clustering algorithms to the same data set.

도 2를 참고하면, 왼쪽은 밀도 기반 기법의 한 종류인 DBSCAN 알고리즘에 따른 클러스터 결과이고, 오른쪽은 K = 2로 지정된 k-means 알고리즘에 따른 클러스터 결과이다. Referring to FIG. 2, the left is a cluster result according to the DBSCAN algorithm, which is a type of density-based technique, and the right is a cluster result according to the k-means algorithm designated as K = 2.

계층 기법(Hierarchical algorithms) 데이터 셋(data set)에 대해 계층적인 형태의 분할 상태를 생성한다. 어떤 두 리프 노드(leaf node)의 공통되는 조상 노드(Ancestor node)가 bottom level로부터 가까울수록 이 두 데이터 오브젝트(data object)는 비슷한 성질을 가지고, 클러스터 분석 관점에서는 같은 클러스터에 속할 가능성이 높다고 볼 수 있는 것이 이 덴드로그램(dendrogram)의 특징이다. 계층 기법으로부터 덴드로그램(dendrogram)을 생성하는 방법은 크게 상향식(bottom up)의 계층적 결합 클러스터링(HAC, Hierarchical agglomerative clustering) 알고리즘과 하향식(top down)의 계층적 분할 클러스터링(HDC, Hierarchical divisive clustering) 알고리즘으로 나뉜다. 이들의 단순(naive)한 접근에 의한 시간 복잡도는 각각 O(n^3), O(2^n)으로 매우 크나, 현재는 실용적인 알고리즘들이 존재한다.Hierarchical algorithms Generate hierarchical partitioning states for a data set. The closer the Ancestor node of any two leaf nodes is from the bottom level, the more likely these two data objects have similar properties, and from a cluster analysis point of view, it is highly likely that they belong to the same cluster. It is a feature of this dendrogram. The method of generating a dendrogram from the hierarchical technique is largely a bottom up hierarchical agglomerative clustering (HAC) algorithm and a top down hierarchical divisive clustering (HDC). It is divided into algorithms. The time complexity of their naive approach is very large, O(n^3) and O(2^n), respectively, but practical algorithms exist now.

도 3은 계층 기법(Hierarchical clustering)에 따른 분석 결과를 나타내는 덴드로그램(Dendrogram)이다.3 is a dendrogram showing the analysis result according to hierarchical clustering.

도 3을 참조하면, 미국의 50개주를 지역적 특성에 따라 분류하여 가까울 수록 바닥 레벨(bottom level)에서 가까이 위치하는 것을 확인할 수 있다.Referring to FIG. 3, it can be seen that the 50 states of the United States are classified according to regional characteristics, and the closer they are, the closer they are from the bottom level.

데이터 생성부(120)는 사용자에 의해 선택된 특정 데이터 분포를 따르는 새로운 데이터가 자동으로 생성되어 클러스터링 시뮬레이션의 분석 대상이 되는 데이터 셋에 추가된다.The data generator 120 automatically generates new data according to a specific data distribution selected by the user, and is added to a data set to be analyzed for clustering simulation.

클러스터부(130)는 상기 선택부(110)에서 선택한 알고리즘과 파라미터를 기초로 상기 데이터 생성부(12)에서 생성된 데이터 셋을 분석하여 클러스터를 생성하고, 각각의 클러스터에 식별을 위해 색상을 지정한다.The cluster unit 130 generates a cluster by analyzing the data set generated by the data generator 12 based on the algorithm and parameter selected by the selection unit 110, and assigns a color to each cluster for identification. do.

클러스터링 시뮬레이션 수행 시, 클러스터링 분석이 진행되기 전에는 회색(무색)으로 표현되어 있던 점들의 색이 클러스터링이 진행됨에 따라 지속적으로 바뀌게 되는데, 이 색의 변화를 통해 실시간으로 각 점이 속한 클러스터를 알 수 있고 전체적인 클러스터링 상태를 한 눈에 볼 수 있다. When performing a clustering simulation, the color of the dots that were gray (colorless) before the clustering analysis proceeds continuously changes as the clustering proceeds. Through this color change, the cluster to which each dot belongs can be known in real time. Clustering status can be seen at a glance.

또한, 상기 데이터 생성부에 의해 데이터 셋에 데이터가 추가될 때마다 상기 선택부에서 선택된 알고리즘과 파라미터에 따라 데이터를 분석하여 클러스터를 다시 생성하고, 상기 다시 생성된 클러스터에 따라 각 점이 속한 클러스터가 변화하면, 변화된 클러스터에 대응하여 색을 지속적으로 변화시키게 된다.In addition, whenever data is added to the data set by the data generator, the cluster is regenerated by analyzing data according to the algorithm and parameter selected by the selection unit, and the cluster to which each point belongs is changed according to the regenerated cluster. Then, the color is continuously changed in response to the changed cluster.

시각화부(140)는 상기 클러스터부(130)에서 분석한 결과에 따라 지정된 색상을 시각화한다. 좌우로 분할된 두 개의 화면에는 상기 선택부(110)에서 선택한 각각의 알고리즘 혹은 파라미터에 따라 다른 결과를 보여주게 된다.The visualization unit 140 visualizes a designated color according to the result of analysis by the cluster unit 130. Different results are displayed according to each algorithm or parameter selected by the selection unit 110 on the two screens divided left and right.

이 때, 기본적인 표현 방범은 XY좌표에 따른 스캐터 플롯(scatter plot)에 나타내는 것이 일반적이나, 선택부에서 계층 기법(Hierarchical algorithms)에 속하는 알고리즘을 선택한 경우, 선택한 알고리즘에 대한 데이터 분석 결과는 덴드로그램(Dendrogram plot)으로 나타낼 수 있다.At this time, the basic expression security is generally indicated in a scatter plot according to XY coordinates, but when an algorithm belonging to hierarchical algorithms is selected in the selection section, the data analysis result for the selected algorithm is a dendrogram. It can be represented by (Dendrogram plot).

학습부(150)는 시각화부의 두 개의 화면에 나타난 결과를 보고 상기 선택부에서 지정한 알고리즘과 파라미터의 영향을 시각적 방법으로 확인하여 알고리즘의 원리나 동작 과정을 학습하는 것을 보조한다.The learning unit 150 assists in learning the principle or operation process of the algorithm by viewing the results displayed on the two screens of the visualization unit and visually checking the influence of the algorithm and parameter specified by the selection unit.

이 때, 이미 클러스터링 시뮬레이션을 수행하여 클러스터 결과를 알고 있는 샘플 분포 데이터 셋에 대해 클러스터링 시뮬레이션을 수행하는 경우에는, 이미 알고 있는 클러스터 결과 및 이번 클러스터링 시뮬레이션 수행 결과로 두 개의 화면에 나타난 시뮬레이션의 클러스터 결과를 비교하여 분류의 유사성 및 차이점을 수치, 표 및 그래프 중 어느 하나로 표시되고 이에 따라, 상기 선택부에서 지정한 알고리즘과 파라미터의 영향을 더욱 효율적으로 학습할 수 있게 된다.At this time, in the case of performing clustering simulation on a sample distribution data set that has already performed clustering simulation and knowing the cluster result, the cluster result of the simulation shown on the two screens is displayed as the known cluster result and the result of this clustering simulation. By comparison, the similarity and difference of the classification are displayed in any one of numerical values, tables, and graphs, and accordingly, it is possible to more efficiently learn the influence of the algorithm and parameter designated by the selection unit.

도 4 및 도 5는 클러스터링 알고리즘의 비교 시뮬레이션 실행 결과의 일 예를 나타내는 도면이다.4 and 5 are diagrams showing an example of a comparison simulation execution result of a clustering algorithm.

도 4를 참조하면, 동일한 데이터 셋에 대해 별개의 알고리즘을 적용한 결과를 알 수 있다. Referring to FIG. 4, the result of applying a separate algorithm to the same data set can be seen.

장치의 각 부는 프로그램 인터페이스로 제공 될 수 있다. 선택부(110)는 상단과 하단에 각각 알고리즘을 선택할 수 있고 오른쪽에 있는 칸에서 각 알고리즘에서 클러스터링 분석에서 필요한 파라미터를 입력할 수 있다(210).Each part of the device can be provided as a program interface. The selection unit 110 may select an algorithm at the top and bottom, and may input parameters required for clustering analysis in each algorithm in the column on the right (210).

또한, 데이터 생성부(120)는 미리 규정되어 제공되는 데이터 분포 선택 리스트에서 어느 하나를 선택한 후 옆의 generate버튼을 누르면 선택된 데이터 분포를 따르는 새로운 데이터가 자동으로 생성되어 데이터 셋에 추가된다(270).In addition, the data generation unit 120 selects any one from the predefined and provided data distribution selection list and then clicks the generate button next to it, and new data following the selected data distribution is automatically generated and added to the data set (270). .

데이터 셋에 대해 클러스터 분석을 수행하고 색상을 지정한 후 좌우 두 개의 화면에 시각화를 수행한다. 좌측은 계층적 결합 클러스터링(HAC) 알고리즘에 의해 클러스터가 생성된 모습(250)이며, 우측은 분할 기법(Partitioning algorithms) 중 k-means에 의해 클러스터가 생성된 모습(230)이다.After cluster analysis is performed on the data set and color is specified, visualization is performed on the left and right two screens. On the left is a diagram 250 generated by the hierarchical combined clustering (HAC) algorithm, and on the right is a diagram 230 generated by k-means among partitioning algorithms.

도 5를 참조하면, 동일한 데이터 셋에 동일한 알고리즘인 k-means를 선택하되, 파라미터를 달리하였을 때 클러스터 결과를 알 수 있다. 좌측(330)은 k를 3으로 지정한 결과로 노란색, 빨간색, 파란색의 세가지 색으로 분류된 것을 알 수 있고, 우측(310)은 k를 5로 지정한 결과로 파란색, 노란색, 녹색, 분홍색, 빨간색의 다섯 가지 색으로 분류된 것을 알 수 있다.Referring to FIG. 5, when k-means, which is the same algorithm, is selected for the same data set, but the parameters are different, the cluster result can be known. It can be seen that the left 330 is classified into three colors of yellow, red, and blue as a result of specifying k as 3, and the right 310 is a result of specifying k as 5, and the result is blue, yellow, green, pink, and red. You can see that it is classified into five colors.

도 6은 클러스터링 알고리즘의 클래스 다이어그램(class diagram)이다.6 is a class diagram of a clustering algorithm.

도 6을 참조하면, 알고리즘의 구현을 위해 JAVA로 설계되어 있고, 핵심 클래스인 추상 클래스(410)와 개별 클러스터링 알고리즘들(430, 450 및 470)과 상속 관계를 알 수 있다. 클러스터링 알고리즘 구현의 핵심 클래스는 Clustering(410) 추상 클래스(abstract class)로, 이 클래스를 상속받아 cluster() 함수만 재정의하면 naiveHAC(430), K-means(450), DBSCAN(470)처럼 특정 클러스터링 알고리즘들을 편리하게 추가할 수 있고 시뮬레이션 가능한 알고리즘 종류의 폭을 넓힐 수 있다. Referring to FIG. 6, it is designed in JAVA for the implementation of the algorithm, and an inheritance relationship between the abstract class 410 and individual clustering algorithms 430, 450 and 470, which is a core class, can be seen. The core class of clustering algorithm implementation is the Clustering(410) abstract class, and if you inherit this class and redefine only the cluster() function, specific clustering like naiveHAC(430), K-means(450), and DBSCAN(470) Algorithms can be conveniently added and the range of algorithms that can be simulated can be expanded.

도 7은 본 발명의 일 실시 예에 따른 클러스터링 알고리즘의 비교 학습 방법의 개략적인 흐름을 나타내는 순서도이다.7 is a flow chart showing a schematic flow of a comparative learning method of a clustering algorithm according to an embodiment of the present invention.

사용자에 의해 하나 또는 두 개의 클러스터링 알고리즘을 선택하고(S110), 선택된 알고리즘에 대해 분석에 필요한 세부 파라미터와 수행 속도를 조정할 수 있다(S120).One or two clustering algorithms may be selected by the user (S110), and detailed parameters required for analysis and execution speed of the selected algorithm may be adjusted (S120).

사용자에 의해 선택된 특정 데이터 분포를 따르는 새로운 데이터가 자동으로 생성되어 클러스터링 시뮬레이션의 분석 대상이 되는 데이서 셋에 추가된다(S130).New data following a specific data distribution selected by the user is automatically generated and added to the data set to be analyzed for the clustering simulation (S130).

상기 알고리즘과 파라미터를 기초로 상기 데이터 셋을 분석하여 클러스터를 생성하고, 각각의 클러스터에 시각적 식별을 용이하게 하기 위해 색상을 지정한다(S140). Clusters are generated by analyzing the data set based on the algorithm and parameters, and colors are assigned to each cluster in order to facilitate visual identification (S140).

지정된 색상으로 분류된 데이터 셋은 좌우로 분할된 두 개의 화면에는 각각의 알고리즘 혹은 파라미터에 따라 각각의 결과를 보여준다(S150). 이 때 기본적인 표현 방법은 XY좌표에 따른 스캐터 플롯(scatter plot)에 나타내는 것이 일반적이나, 선택부에서 계층 기법(Hierarchical algorithms)에 속하는 알고리즘을 선택한 경우, 선택한 알고리즘에 대한 데이터 분석 결과는 덴드로그램(Dendrogram plot)으로 나타낼 수 있다.The data set classified by the designated color shows each result according to each algorithm or parameter on two screens divided left and right (S150). In this case, the basic expression method is generally expressed in a scatter plot according to the XY coordinates, but when an algorithm belonging to hierarchical algorithms is selected in the selection section, the data analysis result for the selected algorithm is a dendrogram ( Dendrogram plot).

두 개의 화면에 나타난 결과를 보고 알고리즘과 파라미터의 영향을 시각적 방법으로 확인하여 알고리즘의 원리나 동작 과정을 학습하는 것을 보조한다(S160).By viewing the results displayed on the two screens and visually checking the influence of the algorithm and parameters, it assists in learning the principle or operation process of the algorithm (S160).

이와 같은, 클러스터링 알고리즘의 비교 학습 방법은 어플리케이션으로 구현되거나 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.Such a comparative learning method of the clustering algorithm may be implemented as an application or in the form of program instructions that can be executed through various computer components and recorded in a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination.

상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거니와 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다.The program instructions recorded in the computer-readable recording medium may be specially designed and constructed for the present invention, and may be known and usable to those skilled in the computer software field.

컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD 와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magnetic-optical media such as floptical disks. media), and a hardware device specially configured to store and execute program instructions such as ROM, RAM, flash memory, and the like.

프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of the program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules to perform processing according to the present invention, and vice versa.

이상에서는 실시 예들을 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to the embodiments, those skilled in the art will understand that various modifications and changes can be made to the present invention without departing from the spirit and scope of the present invention described in the following claims. I will be able to.

100: 클러스터링 알고리즘의 비교 학습 장치
110: 선택부
120: 데이터 생성부
130: 클러스터부
140: 시각화부
150: 학습부100: Comparison learning device of clustering algorithm
110: selection
120: data generation unit
130: cluster unit
140: visualization unit
150: Learning Department

Claims

A selection unit for selecting one or two algorithms from among clustering algorithms and designating a parameter necessary for analysis in the selected algorithm;
A data generator for selecting a data distribution, automatically generating data according to the selected distribution, and adding it to a data set;
A cluster unit for generating a cluster by analyzing the data set according to an algorithm and parameter designated by the selection unit, and designating a color to each of the generated clusters;
A visualization unit that simultaneously displays the operation process of each algorithm using a scatter plot on a screen divided into two colors designated by the cluster unit; And
In order to display the influence of the algorithm and parameter specified in the selection unit from the results displayed on the two screens of the visualization unit, a clustering simulation is performed by reading a sample distribution data set that already knows the cluster result by performing a clustering simulation, Includes; a learning unit for comparing the cluster result and the cluster result of the simulation displayed on the two screens of the visualization unit and displaying the difference in one of numerical values, tables, and graphs, and
The selection unit is an apparatus for comparative learning of a clustering algorithm, which designates different parameter values for the same algorithm.

The method of claim 1, wherein the clustering algorithm,
A device for comparative learning of clustering algorithms, including partitioning algorithms, hierarchical algorithms, and density-based algorithms.

The method of claim 1, wherein the cluster unit,
In the case of performing a clustering simulation, each time data is added to the data set by the data generating unit, the data is analyzed according to the algorithm and parameter selected by the selection unit to regenerate the cluster, and each Device for comparative learning of a clustering algorithm that continuously changes color in response to the changed cluster when the cluster to which the point belongs changes

The method of claim 1, wherein the visualization unit,
When an algorithm belonging to hierarchical algorithms is selected in the selection unit, a dendrogram plot for the selected algorithm is displayed as an apparatus for comparative learning of clustering algorithms.

delete

The method of claim 1, wherein the clustering algorithm,
A device for comparative learning of clustering algorithms that can be added by inheriting and redefining abstract classes implemented in Java language and implemented in advance.

In a method for comparative learning of a clustering algorithm performed by a device for comparative learning,
Selecting one or two algorithms from among clustering algorithms;
Selecting parameters necessary for analysis in the selected algorithm;
Selecting a data distribution, automatically generating data according to the selected distribution, and adding it to the data set;
Analyzing the data of the generated data set according to the selected algorithm and parameter to generate a cluster, and designating a color for each of the generated clusters;
Simultaneously showing the operation process of each algorithm as a scatter plot on two screens of the designated color and coordinates; And
In order to display the influence of the selected algorithm and parameter from the results displayed on the two screens, when a clustering simulation is performed by reading a sample distribution data set with known cluster results by performing clustering simulation, the cluster result and the Comparing the cluster results of the simulations displayed on the two screens and displaying the influence of the selected algorithm and parameter from any one of numerical values, tables, and graphs.
The step of selecting a parameter includes selecting different parameter values for the same algorithm, a method for comparative learning of a clustering algorithm.

The method of claim 7, wherein the step of assigning a color to each of the generated clusters,
In the case of performing clustering simulation, whenever data is added, a cluster is regenerated by analyzing data according to the selected algorithm and parameters, and when the clusters to which each point belongs changes according to the regenerated cluster, the point corresponding to the changed cluster A method for comparative learning of clustering algorithms that continuously change the colors of fields.

The method of claim 7, wherein simultaneously showing the operation of the algorithm,
In the step of selecting the algorithm, when an algorithm belonging to hierarchical algorithms is selected, the selected algorithm is expressed as a dendrogram plot.

delete

A computer-readable storage medium having a computer program recorded thereon for executing the method for comparative learning of the clustering algorithm according to any one of claims 7 to 9.