KR101577249B1 - Device and method for voronoi cell-based support clustering - Google Patents

Device and method for voronoi cell-based support clustering Download PDF

Info

Publication number
KR101577249B1
KR101577249B1 KR1020140031027A KR20140031027A KR101577249B1 KR 101577249 B1 KR101577249 B1 KR 101577249B1 KR 1020140031027 A KR1020140031027 A KR 1020140031027A KR 20140031027 A KR20140031027 A KR 20140031027A KR 101577249 B1 KR101577249 B1 KR 101577249B1
Authority
KR
South Korea
Prior art keywords
data
cluster
clustering
representative point
representative
Prior art date
Application number
KR1020140031027A
Other languages
Korean (ko)
Other versions
KR20150108173A (en
Inventor
이재욱
손영두
이수지
김경옥
Original Assignee
서울대학교산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 서울대학교산학협력단 filed Critical 서울대학교산학협력단
Priority to KR1020140031027A priority Critical patent/KR101577249B1/en
Publication of KR20150108173A publication Critical patent/KR20150108173A/en
Application granted granted Critical
Publication of KR101577249B1 publication Critical patent/KR101577249B1/en

Links

Images

Landscapes

  • Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)

Abstract

The clustering definition unit extracts sample data from a data set stored in a data storage and defines each cluster from the extracted sample data. And a clustering allocator allocating each data of the data set to one of the defined clusters, wherein the clustering definition unit calculates a representative point usable as a seed of a Voronoi cell, And defining the clusters by labeling the representative points.

Description

TECHNICAL FIELD [0001] The present invention relates to a Voronoi cell based support clustering apparatus and method,

The present invention relates to a clustering apparatus and method.

In clustering, similar or related items are grouped into a plurality of clusters (hereinafter referred to as clusters or clusters), and the clustering is shown in the space as shown in the example of FIG. 5 . In other words, clustering classifies data of each object and their characteristics with similar data into the same cluster, and each cluster is different.

This clustering analysis is used to characterize clustering data and is used in a wide variety of fields such as machine learning, image analysis, information retrieval, pattern recognition, and data analysis. Thus, various clustering methods have been developed.

Among them, kernel support clustering, that is, support based clustering using the kernel, is able to create a complex type of clustering boundary as compared with other clustering methods, and has excellent ability to deal with outlier data, There are many advantages that are being studied. The inventor of the present invention has also filed a patent for a Basen-cell-based clustering technique using a dynamical system (Application No. 10-2007-00844468).

However, such conventional support-based clustering has a problem that the load of clustering operation is too high. Especially when processing large amounts of data. Therefore, a clustering apparatus and method which can define complex type clusters and have high computational load with high accuracy are needed.

Korean Patent No. 10-1024038 ("Cluster configuration method of cluster sensor network and sensor network to which the method is applied") discloses a configuration for clustering sensor nodes of a sensor network in connection with the present invention.

In addition, US Patent Application Publication No. US-A-2010-0036647 (" Efficient computation of Voronoi diagrams of general generators in general spaces and uses thereof ") discloses a configuration for generating a Voronoi diagram.

SUMMARY OF THE INVENTION It is an object of the present invention to provide a clustering apparatus and method with high computational load and high accuracy.

According to an aspect of the present invention, there is provided a clustering apparatus including: a clustering defining unit that extracts sample data from a data set stored in a data storage and defines clusters from the extracted sample data; And a clustering allocator allocating each data of the data set to one of the defined clusters, wherein the clustering definition unit calculates a representative point usable as a seed of a Voronoi cell, And defining the clusters by labeling the representative points.

According to a second aspect of the present invention, there is provided a clustering method using a clustering apparatus, comprising the steps of: extracting sample data from a data set stored in a data store; defining clusters from the extracted sample data; Clustering definition step; And a clustering allocation step of allocating each data of the data set to one of the defined clusters, wherein the clustering definition step calculates a representative point that can be used as a seed of a Voronoi cell And defining each of the clusters by labeling the representative points.

The present invention obtains the effect that the computation load is low and the accuracy is high in the clustering apparatus and method.

The computational load can be greatly reduced by using Voronoi cells instead of baseline cell clustering with high computational complexity.

In addition, using baseline cell-based clustering for the required data, it is possible to define complex types of clusters and maintain the advantage of accurate clustering.

1 shows a structure of a clustering apparatus according to an embodiment of the present invention.
FIG. 2 shows a flow of a clustering method according to an embodiment of the present invention.
Figure 3 illustrates the flow of clustering definition steps in accordance with an embodiment of the present invention.
FIG. 4 illustrates a flow of the clustering allocation step according to an embodiment of the present invention.
Figure 5 shows an embodiment of clustering.
Figure 6 shows a comparison of Basen cell and Voronoi cell.
Fig. 7 shows the concept of a basin cell.
Figure 8 shows the concept of a Voronoi cell.
Figure 9 shows a partial graph.
FIG. 10 conceptually shows an embodiment using a Voronoi cell instead of a Basin cell.
11 to 15 show experimental results for verifying the effect of the present invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, which will be readily apparent to those skilled in the art. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

Throughout the specification, when a part is referred to as being "connected" to another part, it includes not only "directly connected" but also "electrically connected" with another part in between . Also, when an element is referred to as "comprising ", it means that it can include other elements as well, without departing from the other elements unless specifically stated otherwise.

Hereinafter, the structure of the clustering apparatus according to an embodiment of the present invention and the flow of the clustering method according to an embodiment of the present invention will be briefly described with reference to the structural view of FIG. 1 and the flow chart of FIG. 2 to FIG. 5 through 10, the concept of baseline cell-based clustering according to an embodiment of the present invention will be described in detail, and specific mathematical expressions and algorithms will be described. .

1 shows a structure of a clustering apparatus according to an embodiment of the present invention.

The clustering apparatus 10 according to an embodiment of the present invention includes a cluster defining unit 100, a cluster allocating unit 200, and a data storage 300.

The clustering apparatus 10 according to an embodiment of the present invention uses a kernel support clustering method based on a Voronoi cell. The kernel support clustering method is divided into two stages. The first step is to create a support function representing the boundaries of the cluster, and the second step is a labeling step to allocate a cluster of each data through the support function.

The two steps are performed by the cluster defining unit 100 and the cluster allocating unit 200 of the clustering apparatus 10 according to an embodiment of the present invention. That is, the cluster defining unit 100 creates support functions to define each cluster, and the cluster allocating unit 200 allocates (labels) the data to one of the defined clusters.

The data to be clustered and the clustering result may be stored in the data store 300. For example, the cluster defining unit 100 can define a cluster using the sample data extracted from the data stored in the data storage 300, and the cluster allocating unit 200 is stored in the data storage 300 The results of allocating the data to each defined cluster may also be stored in the data store 300.

In this case, by defining a cluster by using a basin cell and labeling the data, the data can be accurately classified into complex types of clusters, but the computational load is high. Therefore, the cluster defining unit 100 and the cluster allocating unit 200 according to an embodiment of the present invention can define a cluster boundary by simply approximating a Voronoi cell, label it with a cluster boundary defined by Voronoi cell approximation Data can be labeled with simple operations using this data, and labeling operations (eg, dynamic systems) when a cluster is defined using baseline cells only for data that can not be done, .

FIG. 2 shows a flow of a clustering method according to an embodiment of the present invention.

First, a cluster is defined by dividing a sample space (S100). Extracts sample data from the data set stored in the data storage 300, creates a support function indicating the boundary of the cluster, and divides the space of the sample data into a plurality of clusters.

Next, data is allocated to each cluster (S200). That is, the data stored in the data repository 300 is labeled through the support function.

FIGS. 3 to 4 illustrate the flow of each step of the clustering method according to an embodiment of the present invention. As described above, the flow of each step will be briefly described and the details will be described later.

Figure 3 illustrates the flow of clustering definition steps in accordance with an embodiment of the present invention.

In the cluster definition (S100), first, a data sample is extracted (S110).

Next, the support function and the optimum value are calculated (S120). The support function is conceptually defined as a contour line when the data is expressed in the feature space, as will be described later.

Next, a dynamic system is constructed using the support function and a representative point is calculated (S130). The representative point is used as a seed point of the Voronoi cell, which will be described later.

Next, a weighted graph is constructed using representative points (S140).

Next, it is converted into a partial graph (S150). In the weighted graph, truncation of trunks having weights less than the threshold value is divided into a plurality of partial graphs.

Next, the cluster definition is completed by labeling each representative point using the calculated partial graph (S160).

FIG. 4 illustrates a flow of the clustering allocation step according to an embodiment of the present invention.

The cluster allocation (S200) repeats the following steps for each data (S260) until the labeling is completed (S260).

First, the representative point nearest to the data and the nearest representative point are calculated (S220).

Next, it is determined whether the ratio of the data and the distance between two representative points is equal to or higher than a certain level or whether the two representative points belong to the same cluster (S230). This is a step of determining whether labeling is to be performed using a Voronoi cell or a dynamic system used in a Basen-cell-based clustering technique, as described above.

When the Voronoi cell is used, the cluster is allocated to the closest representative point cluster (S240).

If the Voronoi cell can not be used, the representative point is calculated by using the dynamic system and assigned to the calculated cluster of representative points (S250).

5 through 10, the concept of baseline cell-based clustering according to an embodiment of the present invention will be described in detail.

Figure 5 shows an embodiment of clustering.

(a) is a diagram showing data stored in the data storage 300 in a space, and (b) shows a result of classifying the data into a plurality of clusters. As shown, the data are classified into a plurality of clusters according to the distance between them in space.

Therefore, clustering can be seen as dividing the space into multiple clusters. The inventor of the present invention has already applied for a patent to define the boundary of a cluster by using the concept of basin cell (Application No. 10-2007-00844468) However, since there is a problem that the calculation load is high as described above, the present invention simplifies the Voronoi cell.

Fig. 6 shows a comparison between Basen cell and Voronoi cell, Fig. 7 shows the concept of Basen cell, and Fig. 8 shows the concept of Voronoi cell.

Referring to FIG. 7, the concept of a Basin cell will be described. When a point at which a slope of a function becomes zero is referred to as a stable equilibrium vector (SEV or a stable equilibrium point: SEP, hereinafter referred to as an equilibrium point or SEP) As can be seen, the Basen cell is the section that converges to the equilibrium point. Therefore, each equilibrium point can represent each basin cell.

Referring to FIG. 8, the Voronoi tessellation is a method of dividing a space by using a series of points called a seed, Is called a Voronoi cell, and a diagram obtained by dividing a space by Voronoi cells as shown in FIG. 8 is called a Voronoi diagram. Therefore, the points included in the Voronoi cell are closer to the seed of the corresponding cell than the seeds of the other cells, so that the Voronoi cell technique is suitable for clustering.

In this specification, the seed of each Voronoi cell is defined as a representative equilibrium point (REP). This point is also an equilibrium point in the concept of FIG. That is, the clustering apparatus 10 according to an embodiment of the present invention uses an equilibrium point, which is a representative point of each Basin cell, as a representative point representing each Voronoi cell.

FIG. 6 shows a well-known two-dimensional artificial data set, (a) 2D-N200, (b) Fourth artificial data set, (Curves) and Voronoi cells (straight lines) for two-circles, -Gaussian, (c) Two-circles, and (d) Three-circles. As shown, it can be seen that a cluster can be defined such that the boundary of the Voronoi cell is very close to the boundary of the basin cell.

However, it is shown that Voronoi cell and basin cell do not coincide exactly because the bridge of two adjacent representative points may not be located exactly at the basin cell boundary. Therefore, as described above, the clustering allocation step S200 according to an exemplary embodiment of the present invention includes a step S230 of determining whether data labeling can be performed using a Voronoi cell. The detailed concept will be described in FIG.

Figure 9 shows a partial graph.

As shown in the figure, after dividing the representative point into a plurality of partial graphs, except for the trunk line S3 having a weight value lower than the reference value, a weight graph having weights assigned to the trunk lines is constructed. Each subgraph constitutes one Voronoi cell, that is, one cluster. The concept of the weight graph and the partial graph are discussed in the application for clustering using Basen cell of the present invention as described above, and therefore, detailed description thereof will be omitted.

FIG. 10 conceptually shows an embodiment using a Voronoi cell instead of a Basin cell.

The figure shows a Voronoi cell corresponding to one Basin cell (the same point (R) is the same) in order to more specifically describe the contents described in FIG. Since the bases cell C1 is expressed by a curve as shown in the drawing, the data can be accurately labeled even if the shape of the cluster is complicated. On the other hand, since the Voronoi cell (C2) simplifies the boundary to a straight line, the operation of defining a cluster and the operation of assigning data to the cluster can be simplified.

(c), since the point P1 exists in both the Voronoi cell and the Basen cell, even if Voronoi cell is used instead of Basin cell, the point belonging to the cluster can be labeled. However, since the point P2 exists outside the Voronoi cell but exists in the basin cell, if the Voronoi cell is used instead of the Basin cell, it is not labeled as a point belonging to the cluster. Therefore, the method used in Basen cell should be used .

Therefore, after calculating the nearest representative point (R) closest to the data point (P2) (S220), it is determined whether the ratio of the data and the distance between two representative points is equal to or greater than a certain level (S230 ).

Hereinbefore, before explaining the Voronoi cell-based clustering according to an embodiment of the present invention, the basic concept has been described. Now, specific formulas and algorithms of Voronoi cell-based clustering according to an embodiment of the present invention will be described in detail.

As described above, the kernel support clustering method is roughly divided into two steps. The first step is to create a support function indicating the boundaries of the cluster, and the second step is a labeling step to assign each data cluster through the support function.

1) Support function

Figure 112014025462536-pat00001
Is a function that sends n-dimensional data in positive real numbers and measures the support of data distribution. This is divided into m subset of m connected as follows.

Figure 112014025462536-pat00002
(EQ-1-1)

here

Figure 112014025462536-pat00003
Are connected to each other in a subset
Figure 112014025462536-pat00004
Which is determined by the number of clusters. As described above, if Fig. 5- (a) is the original data, Fig. 5- (b) shows the result divided into m clusters.

The support function can be generated by the SVDD (Support Vector Domain Description) method. The SVDD method maps data points to a high dimensional feature space and uses a method to find the minimum radius containing most of the data in this feature space. When the sphere thus found is reconstructed into a data space, it is divided into m closed sets representing each cluster. At this time,

Figure 112014025462536-pat00005
(EQ-1-2) is obtained from the kernel support function trained by the SVDD method.

Figure 112014025462536-pat00006
(EQ-1-2)

In the above equation

Figure 112014025462536-pat00007
,
Figure 112014025462536-pat00008
,
Figure 112014025462536-pat00009
Are the support vectors and their coefficients. The method used in the present invention is not limited to SVDD but can be applied to all methods for obtaining a support function experimentally through data.

2) After obtaining the support function, there is a labeling step. There are several methods for support clustering labeling. Proximity graph-based methods include Delaunay Diagram (DD), Minimum Spanning Tree (MST), and K-nearest neighbor (KNN). However, the method of MST and KNN is likely to miss important corners, so it is likely to be clumped differently from actual boundaries. To solve this problem, an equilibrium vector-based cluster (EVC) method using topological features of trained support functions has been proposed.

The EVC consists of two steps. The first step is to use the support function

Figure 112014025462536-pat00010
(EQ-1-3), which is related to the dynamic system.

Figure 112014025462536-pat00011
(EQ-1-3)

here

Figure 112014025462536-pat00012
All data
Figure 112014025462536-pat00013
(Positive Definite Symmetric Matrix).
Figure 112014025462536-pat00014
(EQ-1-3), the time t and the starting point x related to the dynamic system,
Figure 112014025462536-pat00015
The existence of a unique solution for
Figure 112014025462536-pat00016
Is twice differentiable
Figure 112014025462536-pat00017
Of the norm is limited.
Figure 112014025462536-pat00018
A state vector < RTI ID = 0.0 >
Figure 112014025462536-pat00019
Is the equilibrium vector of a system such as Eq. (1-3).
Figure 112014025462536-pat00020
in
Figure 112014025462536-pat00021
The Jacobian procession
Figure 112014025462536-pat00022
If there is no eigenvalue of zero
Figure 112014025462536-pat00023
Is called Hyperbolic. Hyperbolic Equilibrium Vector
Figure 112014025462536-pat00024
(I) Stable Equilibrium Vector (SEV) if all eigenvalues of Hessian are positive and (ii) Otherwise Unstable Equilibrium Vector (UEV). Where SEV is the support function
Figure 112014025462536-pat00025
Which is the local minimum value of.

There is a baseline cell associated with EQ-1-3 as an important concept in EVC for inductive learning. The baseline cell of SEV s is the set of all points that converge to s as the dynamic process proceeds as follows. :

Figure 112014025462536-pat00026

And SEV s basin cell Basen

Figure 112014025462536-pat00027
Is defined as a closure of
Figure 112014025462536-pat00028
Respectively. :

Figure 112014025462536-pat00029

At this time, the boundary of the basin cell is called the basin cell boundary

Figure 112014025462536-pat00030
Respectively. One of the good features of the basin cell is that the entire data space is partitioned into Basin cells of several SEVs under certain conditions, as in the following equation (EQ-1-4).

Figure 112014025462536-pat00031
(EQ-1-4)

here

Figure 112014025462536-pat00032
Is a set of SEVs of the dynamic system of Eq. (1-3). Therefore, the entire data space can be divided into Basin cells. At this time, all the data points converge to a specific SEV by the dynamic process as described above, so that the bases cell can be grasped by finding SEVs. Then, in the second step, label with the adjacency matrix of SEVs or label the entire data space using Transition Equilibrium Vector (TEV).

This method is also applicable to the clusters of formula (EQ-1-1)

Figure 112014025462536-pat00033
Lt; RTI ID = 0.0 > (EQ-1-5) <
Figure 112014025462536-pat00034
Which is a way to expand to.

Figure 112014025462536-pat00035
(EQ-1-5)

here

Figure 112014025462536-pat00036
Cluster
Figure 112014025462536-pat00037
≪ / RTI > As a result, the entire data space can be divided into further expanded clusters as shown in the following equation (EQ-1-6).

Figure 112014025462536-pat00038
(EQ-1-6)

Thus, bounded support vectors located outside the community boundaries, as well as data that is not used for learning, can be subjected to inductive cluster labeling.

The SMO algorithm, a widely used algorithm for solving quadratic programming in support-based clustering,

Figure 112014025462536-pat00039
(
Figure 112014025462536-pat00040
: Number of data). The time complexity of cluster labeling, on the other hand,
Figure 112014025462536-pat00041
(
Figure 112014025462536-pat00042
: The number of support vectors constituting the support function, and m: the number of calculation times of each kernel function).

To speed up cluster labeling and less support vector estimates, we use two main approximation methods for fast computation. 1) First, the support estimates of the entire data are approximated by the support estimates obtained from the small data samples. 2) Second, the partitioned cluster boundary using basin cell is approximated by Voronoi cell. The Voronoi cell approximates each data to the nearest representative point without an explicit call to the kernel support function. The following sections describe the details of this method.

<Sampling and support  Level function construction>

The support level function is a positive scalar function

Figure 112014025462536-pat00043
. And the level set
Figure 112014025462536-pat00044
Estimates the support domain of the data distribution (or the domain surrounding most of the data points). This is divided into m subset of m connected as follows.

Figure 112014025462536-pat00045
(EQ-2-1)

In general, support level functions use the SVDD (Support Vector Domain Description) method, or Gaussian process clustering or other kernel distribution function estimation. In the present invention, SVDD is used to estimate the support function. Nevertheless, the method can easily be applied to any support function. The dataset

Figure 112014025462536-pat00046
, The SVDD method maps data points to a high dimensional feature space and uses a method to find the minimum radius containing most of the data in this feature space. The sphere thus obtained is divided into m closed sets representing the respective clusters when the data space is restored to the data space. Gaussian kernel
Figure 112014025462536-pat00047
And the kernel support function trained by the SVDD method is obtained.

Figure 112014025462536-pat00048
(EQ-2-2)

In the above equation

Figure 112014025462536-pat00049
,
Figure 112014025462536-pat00050
,
Figure 112014025462536-pat00051
Are the support vectors and their coefficients.

In many real-world large data sets, most of the data is typically concentrated in a slightly narrower area, and very small parts of the data in general can quite well explain the distribution of the entire data, particularly its support. We will now justify the use of the sampled data points and explain how the trained Gaussian kernel support function represents the support of the class conditional distribution. In this case, generalization error limit theorem is applied.

Result 1: From a probability distribution P that does not contain discrete components, iid (independently and identically distributed)

Figure 112014025462536-pat00052
Samples of
Figure 112014025462536-pat00053
. For the trained Gaussian kernel support function given by equation (EQ-2-2), the level set
Figure 112014025462536-pat00054
Indicates a portion derived from the level value r. At this time,
Figure 112014025462536-pat00055
Sample probability
Figure 112014025462536-pat00056
All
Figure 112014025462536-pat00057
The following are satisfied.

Figure 112014025462536-pat00058
(EQ-2-3)

Figure 112014025462536-pat00059
, function
Figure 112014025462536-pat00060
All support vectors of
Figure 112014025462536-pat00061
About
Figure 112014025462536-pat00062
when,

Figure 112014025462536-pat00063
(EQ-2-4)

This result indicates that the right-hand side of equation (EQ-2-2) is sufficiently small to fit the purpose of the present invention for large amounts of data

Figure 112014025462536-pat00064
<<
Figure 112014025462536-pat00065
If this is large
Figure 112014025462536-pat00066
Sized sample data can sufficiently capture the support of the data distribution.

<Sample space division>

After a support level function is created from a small percentage of the sampled data set for the entire large data set, the next step is to assign a population label to the sampled data and the remaining data (or unknown test data). This process can be easily done using inductive Basen-cell based labeling methods because they divide the entire data space into basin cells while naturally expanding the clusters in which they are made.

One important drawback to this labeling process, however, is that it can be very costly to calculate because it uses a gradient of functions with the same complexity as the number of trained support level functions or SVs. To overcome these drawbacks and to facilitate the labeling process, the present invention uses a new labeling method that does not require calculation of the function.

function

Figure 112014025462536-pat00067
State vector
Figure 112014025462536-pat00068
end
Figure 112014025462536-pat00069
This point is defined as the representative equilibrium point (REP) (this point is also referred to as a stable equilibrium point (SEP)). The core idea of the present invention is to assign the cluster label of the REP that is closest to each REP-like data to determine the cluster of each data. Define the Voronoi cell of REP s as a closure of all the nearest dataset s as follows.

Figure 112014025462536-pat00070
(EQ-2-5)

Figure 112014025462536-pat00071
All of the data in s is assigned to the same cluster as s. The method then separates the entire data space into Voronoi cells of the REP as follows. (This is similar to the way that the entire data space is separated into Basin cells in the existing Basin-cell based method.)

Figure 112014025462536-pat00072
(EQ-2-6)

Two REPs are defined to be adjacent when each Voronoi cell meets each other. Let us define the segment weights between two adjacent REPs as follows.

Figure 112014025462536-pat00073
(EQ-2-7)

That is,

Figure 112014025462536-pat00074
Wow
Figure 112014025462536-pat00075
Is the smallest one of the function values at the points on the overlapping portion of the boundary of the region of FIG. Let's call the point with the smallest function value as the bridging point. (In this case, the connection point is different from the transition equilibrium point (TEV) in the conventional method.)

Once you have learned these terms, you can use the weighted full graph

Figure 112014025462536-pat00076
Can be defined.

Figure 112014025462536-pat00077
: REP
Figure 112014025462536-pat00078
field

Figure 112014025462536-pat00079
: All REP
Figure 112014025462536-pat00080
With a line,
Figure 112014025462536-pat00081
The weight of each line segment
Figure 112014025462536-pat00082
The
Figure 112014025462536-pat00083
Path between
Figure 112014025462536-pat00084
Adjacent
Figure 112014025462536-pat00085
,
Figure 112014025462536-pat00086
The following equations are given.

Figure 112014025462536-pat00087
(EQ-2-8)

On the other hand, if there are no adjacent REP pairs on all paths,

Figure 112014025462536-pat00088
As shown in Fig.

The obtained definition of the distance between REPs is similar to the definition of path distance in the prior art, but there is a difference. In the prior art

Figure 112014025462536-pat00089
The above definition uses a connection point with the minimum value at the boundary of the Voronoi cell, whereas the TEV point with the minimum value of the above points is used.

Geometrical distance

Figure 112014025462536-pat00090
Is the smallest value among the points having the maximum function value among the points on the path connecting the two REPs through the connection point to the other REP by escaping one REP. graph
Figure 112014025462536-pat00091
, The number K of clusters can be adjusted by assigning one cluster for each REP and merging the clusters of two REPs hierarchically into one cluster using the same procedure as in the agglomerative hierarchical clustering. this
Figure 112014025462536-pat00092
And Voronoi cells are able to quickly process the labeling process for data,
Figure 112014025462536-pat00093
And it saves a lot of storage space because it uses only G information such as REP and segment weight.

now

Figure 112014025462536-pat00094
Partial graph of
Figure 112014025462536-pat00095
To
Figure 112014025462536-pat00096
,
Figure 112014025462536-pat00097
sign
Figure 112014025462536-pat00098
Line segments with REP pairs
Figure 112014025462536-pat00099
. Basen cell of REP
Figure 112014025462536-pat00100
In general
Figure 112014025462536-pat00101
&Lt; / RTI &gt;
Figure 112014025462536-pat00102
.

Figure 112014025462536-pat00103
(EQ-2-9)

In general, Voronoi cells do not satisfy these attributes. However, in many cases,

Figure 112014025462536-pat00104
A set of levels
Figure 112014025462536-pat00105
Lt; / RTI &gt; That is,
Figure 112014025462536-pat00106
Wow
Figure 112014025462536-pat00107
end
Figure 112014025462536-pat00108
In the same connected comp. If
Figure 112014025462536-pat00109
Wow
Figure 112014025462536-pat00110
A set of levels
Figure 112014025462536-pat00111
Belongs to the same community, and the opposite holds.

As described above, FIG. 6 shows a comparison of Voronoi and Basen cells for a well-known two-dimensional artificial data set, 2D-N200, Four-Gaussian, Two-circles and Three-circles. The figure shows that the Voronoi cell and the Basen cell do not match exactly because the connection points of two adjacent REPs may not be located exactly at the baseline cell boundary.

But

Figure 112014025462536-pat00112
The boundaries generated by the baseline cell are similar to those generated by baseline cell,
Figure 112014025462536-pat00113
Is maintained. The reason for this is that, in a region where the data near the representative point is concentrated, the data share the same label. And this means that for each REP
Figure 112014025462536-pat00114
Satisfying
Figure 112014025462536-pat00115
And the union of Voronoi cells constituting a specific cluster for the data set has a function value larger than r as in Basen cells.

The algorithm of the present invention is described for actual implementation. The present invention can be divided into three stages. : 1) Construct a support level function through a sampled dataset based on 'result 1'. 2) Search the REP of the support level function so that the cluster can be labeled. 3) Label the remaining data using fast and slow steps.

<Algorithm: Voronoi  Cell-based kernel support  Clustering>

&Lt; 1) support  Level Function Configuration>

1: given data set

Figure 112014025462536-pat00116
To select a sample at a specific rate that maximizes the representativeness of the sample from the training data set
Figure 112014025462536-pat00117
. Remaining set
Figure 112014025462536-pat00118
Constitute the remaining data pointers.

2:

Figure 112014025462536-pat00119
Support Level Function from
Figure 112014025462536-pat00120
And the optimum value
Figure 112014025462536-pat00121
.

&Lt; 2) support Of the level function REP  Search REP  Assign a cluster label to the>

1: given support level function

Figure 112014025462536-pat00122
Using
Figure 112014025462536-pat00123
A set of REPs from
Figure 112014025462536-pat00124
. This work is done in one of two ways. In this case, the second method is mainly used for fast searching.

Each data point

Figure 112014025462536-pat00125
As an initial point
Figure 112014025462536-pat00126
Apply any available and efficient optimization techniques for.

Figure 112014025462536-pat00127
Are divided into several sets and the dynamic system (EQ-1-3) is applied only to the center point.

2:

Figure 112014025462536-pat00128
To REP,
Figure 112014025462536-pat00129
(EQ-2-7) or (EQ-2-8) to generate graphs of two adjacent REPs as a set of weighted segments.
Figure 112014025462536-pat00130
.

3: Using single-linkage amalgamation

Figure 112014025462536-pat00131
Only those lines with smaller segment weights are collected
Figure 112014025462536-pat00132
Partial graph of
Figure 112014025462536-pat00133
. And
Figure 112014025462536-pat00134
A label of the corresponding cluster is assigned to the cluster.

&Lt; 3) Labeling >

1: Percentage of ambiguity ratio limit

Figure 112014025462536-pat00135
.

2: for

Figure 112014025462536-pat00136
do

3:

Figure 112014025462536-pat00137
The nearest REP
Figure 112014025462536-pat00138
And the second closest REP
Figure 112014025462536-pat00139
.

4: if

Figure 112014025462536-pat00140
Or
Figure 112014025462536-pat00141
Have the same cluster label

Figure 112014025462536-pat00142
The community label of
Figure 112014025462536-pat00143
As shown in FIG.

5: else

We apply the dynamic system (EQ-1-3)

Figure 112014025462536-pat00144
. And
Figure 112014025462536-pat00145
The community label of
Figure 112014025462536-pat00146
As shown in FIG.

6: end if

7: end for

At this time, the following issues arise in implementation. First, as shown in 2) -2, the distance between two adjacent REPs

Figure 112014025462536-pat00147
The overlapping boundary points of two adjacent Voronoi cells, i.e.,
Figure 112014025462536-pat00148
X must satisfy the following equation.

Figure 112014025462536-pat00149
(EQ-2-10)

Figure 112014025462536-pat00150
Midpoint between
Figure 112014025462536-pat00151
Slope direction descending from
Figure 112014025462536-pat00152
The projection of the plane S of the plane S is given as follows. :
Figure 112014025462536-pat00153
then
Figure 112014025462536-pat00154
Is a straight line in the plane S having the decreasing direction d from the starting point x0. From this, we can approximate the minimization problem in (EQ-2-7) as follows. :

Figure 112014025462536-pat00155
(EQ-2-11)

This is a one-dimensional optimization problem that can be solved easily with the well-known cubic interpolation line search technique as in. In step 2) all the sample spaces are divided into several Voronoi cells assigned cluster labels. Therefore, the next step in determining the label at the test point is to determine the Voronoi cell to which the test point belongs. This can be obtained by finding the REP that is closest to the test point.

In step 3)

Figure 112014025462536-pat00156
The nearest REP
Figure 112014025462536-pat00157
And the second nearest REP
Figure 112014025462536-pat00158
An exhaustive search algorithm can be used to calculate the distance from all REPs to this point. However, instead of using this search algorithm to obtain such a close REP to speed labeling, a more advanced method such as kdtree is used.

Also

Figure 112014025462536-pat00159
The ambiguity ratio threshold for confirming the expression
Figure 112014025462536-pat00160
Is used,
Figure 112014025462536-pat00161
Two adjacent
Figure 112014025462536-pat00162
In the ambiguous region between the Voronoi cells, it will invoke the slower method as in step 3) -5. In this invention, in order to more rapidly process the clustering process,
Figure 112014025462536-pat00163
As shown in Fig. This is because most of the data used in the experiment belongs to the same first cluster, the second closest REP belongs to the same cluster,
Figure 112014025462536-pat00164
Is much smaller than 1. If the dataset contains a significant percentage of data points with ambiguity ratios close to 1,
Figure 112014025462536-pat00165
Or set a certain percentage and add a slower step.

The concrete equations and algorithms of baseline cell-based clustering according to an embodiment of the present invention have been described above.

11 to 15 show experimental results for verifying the effect of the present invention.

The figures are the results of experiments performed on several well-known benchmark data sets to demonstrate the clustering performance of the present invention. The performance of the present invention is compared to the performance of conventional techniques for the same data.

FIGS. 11 and 12 show the labeling time for each test data. From the experiment results, it is possible to confirm the improvement result of the labeling time of the present invention.

Figures 13-15 refer to images that have been partially applied in conventional methods for application to image segmentation. Experimental results comparing with other clustering methods are shown in FIG. 14, and image information and refinement time are shown in FIG. From the experimental results, it can be seen that the segmentation result of the present invention is divided into regions of similar colors and the speed is much faster than other methods.

As described above, according to the present invention, it is experimentally confirmed that the labeling time is effectively reduced while ensuring accuracy in the case of a large capacity problem in the clustering method. In particular, it can be seen that the clustering method of the present invention is applied well to actual large capacity problems such as image segmentation.

The clustering apparatus 10 according to an exemplary embodiment of the present invention is a technology that can be applied to various industries that need to analyze clustering of large amounts of data. It can be applied to analyzing large volume data such as customer data in marketing analysis of distributors and electronic companies. It can be applied in the field of identifying objects in images or moving images by image segmentation.

(Example 1) A new clustering method can be applied to image segmentation. Image segmentation can be used for image compression by identifying an object by recognizing a boundary between objects of a specific type in a specific image. However, image data is a large amount of data, especially in the case of high resolution images or moving images, it is difficult to process the image data in real time by the conventional method. However, according to the present invention, image segmentation can be performed in real time even in a large-capacity image data, and this can be applied to research fields such as computer visualization.

(Example 2) In the case of a financial institution such as a bank, it is necessary to use the customer data to judge whether or not to lend or analyze the customer group. However, if new customers increase, all data must be clustered again in order to cluster the customers. As data increases, the training time or cluster allocation time of new data increases. However, when the present invention is used, clusters can be allocated to new data without going through a training process in an inductive manner, and even if data is large in size, customer analysis can be performed in real time. In addition, when the present invention is used to control the number of customer taxa, it is possible to classify the entire customer into a desired number of customer groups and characterize each customer group by identifying the characteristics of customers belonging to the same group, This is possible.

Once again, the present invention provides a Voronoiell-based clustering technique that speeds up support-based clustering without loss of clustering performance. In contrast to previous studies that focused on optimizing only one of the two steps of training and labeling, the present invention optimizes both processes in terms of time and space complexity.

The labeling process and estimation of the kernel support involves a very large number of intensive numbers of calculations for the kernel in every support function call. In terms of computational complexity, SEV labeling takes place within a reasonable computation time, taking into account the training data sample size. This is because the number of SEVs is generally very small compared to the total number of data. However, the process of finding SEV corresponding to each data for all N training data requires considerable calculation time when N is very large. Finding the SEV in one piece of data involves finding the minimum value of the support function and its computational complexity

Figure 112014025462536-pat00166
to be. Therefore, the computational complexity of finding the SEV of each of the N data is
Figure 112014025462536-pat00167
.

This is why equilibrium-based clustering is difficult to apply to large-volume data problems such as image segmentation. SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a method that can be applied to a large capacity problem by improving the speed while maintaining the useful properties described above without impairing accuracy in the labeling step of clustering based on the support function .

To overcome the computational burden of the two steps, the present invention introduces two main approximation techniques, one that introduces sampling based on general error bounds in estimating the support, the second assumes that the dynamic system Instead of looking for a REP that converges on the data, the first step for quick labeling is simply to use the similarity inference that finds the nearest REP from each data.

By calculating the number of samples that can sufficiently capture the support of the data distribution through the first method, and calculating the support function using only this data, the cost of constructing the support function

Figure 112014025462536-pat00168
The cost of the system can be improved. In addition to demonstrating that these samples can represent support well, they not only allowed us to use the sampled data, but they also added theoretical strengths.

Second, at the labeling stage, the labeling time complexity of existing methods is approximately

Figure 112014025462536-pat00169
(
Figure 112014025462536-pat00170
: The number of support vectors constituting the support function, and m: the number of calculations of each kernel function), the computation cost can be greatly increased because the gradient of the function having the same complexity proportional to the number of SVs is used. In order to solve this problem, it is necessary to divide the entire data space into a method of finding a REP nearest to each data. By using a new labeling method which does not require calculation of a function, It can be confirmed that it is dramatically improved without deteriorating performance.

It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

10: Clustering device
100: cluster definition unit
200: cluster allocation unit
300: Data Store

Claims (10)

In a clustering apparatus,
A clustering defining unit that extracts sample data from a data set stored in a data store, defines each cluster from the extracted sample data, calculates and labels representative points of the clusters; And
Calculating a distance between each data of the data set and a representative point of a cluster closest to the closest cluster and a distance between the representative point and a representative point of the next closest cluster, And a clustering allocator for allocating each of the data to one of the defined clusters,
Wherein said representative point is calculated for use as a seed of Voronoi cell.
The method according to claim 1,
Wherein the clustering allocator allocates the data to a cluster of the closest representative point when the ratio is less than or equal to a threshold value.
The method according to claim 1,
Wherein the clustering allocator calculates a target representative point using the dynamical system when the ratio is equal to or larger than the threshold and allocates the data to the calculated cluster of the representative point.
The method according to claim 1,
Wherein the clustering definition unit calculates a support function from the sample data and calculates the representative point using the support function.
The method according to claim 1,
The clustering definition unit constructs a weight graph using the representative points, and deletes trunks having weights less than a threshold value in the weight graph, and divides the trunks into partial graphs so that each divided partial graph is included in each cluster Wherein each cluster is defined.
CLAIMS 1. A method for clustering using a clustering device,
A clustering definition step of extracting sample data from a data set stored in a data store, defining each cluster from the extracted sample data, calculating and labeling representative points of the clusters; And
Calculating a distance between each data of the data set and a representative point of a cluster closest to the closest cluster and a distance between the representative point and a representative point of the next closest cluster, And allocating each data of the data set to one of the defined clusters based on the ratio of the distances of the clusters to the clusters,
Wherein the representative point is calculated for use as a seed of a Voronoi cell.
The method according to claim 6,
Wherein the clustering allocation step allocates the data to a cluster of the closest representative point when the ratio is less than or equal to a threshold value.
The method according to claim 6,
Wherein the clustering allocation step calculates a target representative point by using a dynamical system when the ratio is equal to or greater than a threshold and allocates the data to the calculated cluster of representative representative points.
The method according to claim 6,
Wherein the clustering definition step calculates a support function from the sample data and calculates the representative point using the support function.
The method according to claim 6,
Wherein the clustering definition step includes the steps of constructing a weighted graph using the representative points, deleting trunks having weights less than a threshold value in the weighted graph, and dividing the trunks into partial graphs, Wherein each cluster is defined as a cluster.
KR1020140031027A 2014-03-17 2014-03-17 Device and method for voronoi cell-based support clustering KR101577249B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020140031027A KR101577249B1 (en) 2014-03-17 2014-03-17 Device and method for voronoi cell-based support clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020140031027A KR101577249B1 (en) 2014-03-17 2014-03-17 Device and method for voronoi cell-based support clustering

Publications (2)

Publication Number Publication Date
KR20150108173A KR20150108173A (en) 2015-09-25
KR101577249B1 true KR101577249B1 (en) 2015-12-14

Family

ID=54246287

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020140031027A KR101577249B1 (en) 2014-03-17 2014-03-17 Device and method for voronoi cell-based support clustering

Country Status (1)

Country Link
KR (1) KR101577249B1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220060375A (en) * 2020-11-04 2022-05-11 서울대학교산학협력단 Method and apparatus for performing fair clustering through estimating fair distribution

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152277B (en) * 2023-03-10 2023-09-22 麦岩智能科技(北京)有限公司 Map segmentation method and device, electronic equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
김호숙, 용환승, "가중치를 고려한 그래프 기반의 공간 클러스터링 알고리즘의 설계", 이화여자대학교 컴퓨터학과, 2002년.

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220060375A (en) * 2020-11-04 2022-05-11 서울대학교산학협력단 Method and apparatus for performing fair clustering through estimating fair distribution
KR102542451B1 (en) * 2020-11-04 2023-06-12 서울대학교산학협력단 Method and apparatus for performing fair clustering through estimating fair distribution

Also Published As

Publication number Publication date
KR20150108173A (en) 2015-09-25

Similar Documents

Publication Publication Date Title
CN110443281B (en) Text classification self-adaptive oversampling method based on HDBSCAN (high-density binary-coded decimal) clustering
Fischer et al. Bagging for path-based clustering
US8954365B2 (en) Density estimation and/or manifold learning
CN113221065A (en) Data density estimation and regression method, corresponding device, electronic device, and medium
Kim et al. Improving discrimination ability of convolutional neural networks by hybrid learning
Richards et al. Clustering and unsupervised classification
CN108564083A (en) A kind of method for detecting change of remote sensing image and device
Reddy et al. A Comparative Survey on K-Means and Hierarchical Clustering in E-Commerce Systems
KR101577249B1 (en) Device and method for voronoi cell-based support clustering
Xu et al. MSGCNN: Multi-scale graph convolutional neural network for point cloud segmentation
Zeybek Inlier point preservation in outlier points removed from the ALS point cloud
Liu et al. A new local density and relative distance based spectrum clustering
Ghoshal et al. Estimating uncertainty in deep learning for reporting confidence: An application on cell type prediction in testes based on proteomics
Xu et al. The image segmentation algorithm of colorimetric sensor array based on fuzzy C-means clustering
Zheliznyak et al. Analysis of clustering algorithms
KR100895261B1 (en) Inductive and Hierarchical clustering method using Equilibrium-based support vector
Yu et al. Sparse reconstruction with spatial structures to automatically determine neighbors
Shi et al. Fuzzy support tensor product adaptive image classification for the internet of things
Richards et al. Clustering and Unsupervised Classification
Liu et al. A novel local density hierarchical clustering algorithm based on reverse nearest neighbors
CN112800138A (en) Big data classification method and system
CN113779287A (en) Cross-domain multi-view target retrieval method and device based on multi-stage classifier network
Rao et al. Common object discovery as local search for maximum weight cliques in a global object similarity graph
KR101133804B1 (en) Fast kernel quantile clustering method for large-scale data
Chang et al. Fast marching based superpixels generation

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
GRNT Written decision to grant
FPAY Annual fee payment

Payment date: 20181203

Year of fee payment: 4

FPAY Annual fee payment

Payment date: 20191203

Year of fee payment: 5