KR101577249B1

KR101577249B1 - Device and method for voronoi cell-based support clustering

Info

Publication number: KR101577249B1
Application number: KR1020140031027A
Authority: KR
Inventors: 이재욱; 손영두; 이수지; 김경옥
Original assignee: 서울대학교산학협력단
Priority date: 2014-03-17
Filing date: 2014-03-17
Publication date: 2015-12-14
Also published as: KR20150108173A

Abstract

The clustering definition unit extracts sample data from a data set stored in a data storage and defines each cluster from the extracted sample data. And a clustering allocator allocating each data of the data set to one of the defined clusters, wherein the clustering definition unit calculates a representative point usable as a seed of a Voronoi cell, And defining the clusters by labeling the representative points.

Description

TECHNICAL FIELD [0001] The present invention relates to a Voronoi cell based support clustering apparatus and method,

The present invention relates to a clustering apparatus and method.

In clustering, similar or related items are grouped into a plurality of clusters (hereinafter referred to as clusters or clusters), and the clustering is shown in the space as shown in the example of FIG. 5 . In other words, clustering classifies data of each object and their characteristics with similar data into the same cluster, and each cluster is different.

This clustering analysis is used to characterize clustering data and is used in a wide variety of fields such as machine learning, image analysis, information retrieval, pattern recognition, and data analysis. Thus, various clustering methods have been developed.

Among them, kernel support clustering, that is, support based clustering using the kernel, is able to create a complex type of clustering boundary as compared with other clustering methods, and has excellent ability to deal with outlier data, There are many advantages that are being studied. The inventor of the present invention has also filed a patent for a Basen-cell-based clustering technique using a dynamical system (Application No. 10-2007-00844468).

However, such conventional support-based clustering has a problem that the load of clustering operation is too high. Especially when processing large amounts of data. Therefore, a clustering apparatus and method which can define complex type clusters and have high computational load with high accuracy are needed.

Korean Patent No. 10-1024038 ("Cluster configuration method of cluster sensor network and sensor network to which the method is applied") discloses a configuration for clustering sensor nodes of a sensor network in connection with the present invention.

In addition, US Patent Application Publication No. US-A-2010-0036647 (" Efficient computation of Voronoi diagrams of general generators in general spaces and uses thereof ") discloses a configuration for generating a Voronoi diagram.

SUMMARY OF THE INVENTION It is an object of the present invention to provide a clustering apparatus and method with high computational load and high accuracy.

According to an aspect of the present invention, there is provided a clustering apparatus including: a clustering defining unit that extracts sample data from a data set stored in a data storage and defines clusters from the extracted sample data; And a clustering allocator allocating each data of the data set to one of the defined clusters, wherein the clustering definition unit calculates a representative point usable as a seed of a Voronoi cell, And defining the clusters by labeling the representative points.

According to a second aspect of the present invention, there is provided a clustering method using a clustering apparatus, comprising the steps of: extracting sample data from a data set stored in a data store; defining clusters from the extracted sample data; Clustering definition step; And a clustering allocation step of allocating each data of the data set to one of the defined clusters, wherein the clustering definition step calculates a representative point that can be used as a seed of a Voronoi cell And defining each of the clusters by labeling the representative points.

The present invention obtains the effect that the computation load is low and the accuracy is high in the clustering apparatus and method.

The computational load can be greatly reduced by using Voronoi cells instead of baseline cell clustering with high computational complexity.

In addition, using baseline cell-based clustering for the required data, it is possible to define complex types of clusters and maintain the advantage of accurate clustering.

1 shows a structure of a clustering apparatus according to an embodiment of the present invention.
FIG. 2 shows a flow of a clustering method according to an embodiment of the present invention.
Figure 3 illustrates the flow of clustering definition steps in accordance with an embodiment of the present invention.
FIG. 4 illustrates a flow of the clustering allocation step according to an embodiment of the present invention.
Figure 5 shows an embodiment of clustering.
Figure 6 shows a comparison of Basen cell and Voronoi cell.
Fig. 7 shows the concept of a basin cell.
Figure 8 shows the concept of a Voronoi cell.
Figure 9 shows a partial graph.
FIG. 10 conceptually shows an embodiment using a Voronoi cell instead of a Basin cell.
11 to 15 show experimental results for verifying the effect of the present invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, which will be readily apparent to those skilled in the art. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

Throughout the specification, when a part is referred to as being "connected" to another part, it includes not only "directly connected" but also "electrically connected" with another part in between . Also, when an element is referred to as "comprising ", it means that it can include other elements as well, without departing from the other elements unless specifically stated otherwise.

Hereinafter, the structure of the clustering apparatus according to an embodiment of the present invention and the flow of the clustering method according to an embodiment of the present invention will be briefly described with reference to the structural view of FIG. 1 and the flow chart of FIG. 2 to FIG. 5 through 10, the concept of baseline cell-based clustering according to an embodiment of the present invention will be described in detail, and specific mathematical expressions and algorithms will be described. .

1 shows a structure of a clustering apparatus according to an embodiment of the present invention.

The clustering apparatus 10 according to an embodiment of the present invention includes a cluster defining unit 100, a cluster allocating unit 200, and a data storage 300.

The clustering apparatus 10 according to an embodiment of the present invention uses a kernel support clustering method based on a Voronoi cell. The kernel support clustering method is divided into two stages. The first step is to create a support function representing the boundaries of the cluster, and the second step is a labeling step to allocate a cluster of each data through the support function.

The two steps are performed by the cluster defining unit 100 and the cluster allocating unit 200 of the clustering apparatus 10 according to an embodiment of the present invention. That is, the cluster defining unit 100 creates support functions to define each cluster, and the cluster allocating unit 200 allocates (labels) the data to one of the defined clusters.

The data to be clustered and the clustering result may be stored in the data store 300. For example, the cluster defining unit 100 can define a cluster using the sample data extracted from the data stored in the data storage 300, and the cluster allocating unit 200 is stored in the data storage 300 The results of allocating the data to each defined cluster may also be stored in the data store 300.

In this case, by defining a cluster by using a basin cell and labeling the data, the data can be accurately classified into complex types of clusters, but the computational load is high. Therefore, the cluster defining unit 100 and the cluster allocating unit 200 according to an embodiment of the present invention can define a cluster boundary by simply approximating a Voronoi cell, label it with a cluster boundary defined by Voronoi cell approximation Data can be labeled with simple operations using this data, and labeling operations (eg, dynamic systems) when a cluster is defined using baseline cells only for data that can not be done, .

FIG. 2 shows a flow of a clustering method according to an embodiment of the present invention.

First, a cluster is defined by dividing a sample space (S100). Extracts sample data from the data set stored in the data storage 300, creates a support function indicating the boundary of the cluster, and divides the space of the sample data into a plurality of clusters.

Next, data is allocated to each cluster (S200). That is, the data stored in the data repository 300 is labeled through the support function.

FIGS. 3 to 4 illustrate the flow of each step of the clustering method according to an embodiment of the present invention. As described above, the flow of each step will be briefly described and the details will be described later.

Figure 3 illustrates the flow of clustering definition steps in accordance with an embodiment of the present invention.

In the cluster definition (S100), first, a data sample is extracted (S110).

Next, the support function and the optimum value are calculated (S120). The support function is conceptually defined as a contour line when the data is expressed in the feature space, as will be described later.

Next, a dynamic system is constructed using the support function and a representative point is calculated (S130). The representative point is used as a seed point of the Voronoi cell, which will be described later.

Next, a weighted graph is constructed using representative points (S140).

Next, it is converted into a partial graph (S150). In the weighted graph, truncation of trunks having weights less than the threshold value is divided into a plurality of partial graphs.

Next, the cluster definition is completed by labeling each representative point using the calculated partial graph (S160).

FIG. 4 illustrates a flow of the clustering allocation step according to an embodiment of the present invention.

The cluster allocation (S200) repeats the following steps for each data (S260) until the labeling is completed (S260).

First, the representative point nearest to the data and the nearest representative point are calculated (S220).

Next, it is determined whether the ratio of the data and the distance between two representative points is equal to or higher than a certain level or whether the two representative points belong to the same cluster (S230). This is a step of determining whether labeling is to be performed using a Voronoi cell or a dynamic system used in a Basen-cell-based clustering technique, as described above.

When the Voronoi cell is used, the cluster is allocated to the closest representative point cluster (S240).

If the Voronoi cell can not be used, the representative point is calculated by using the dynamic system and assigned to the calculated cluster of representative points (S250).

5 through 10, the concept of baseline cell-based clustering according to an embodiment of the present invention will be described in detail.

Figure 5 shows an embodiment of clustering.

(a) is a diagram showing data stored in the data storage 300 in a space, and (b) shows a result of classifying the data into a plurality of clusters. As shown, the data are classified into a plurality of clusters according to the distance between them in space.

Therefore, clustering can be seen as dividing the space into multiple clusters. The inventor of the present invention has already applied for a patent to define the boundary of a cluster by using the concept of basin cell (Application No. 10-2007-00844468) However, since there is a problem that the calculation load is high as described above, the present invention simplifies the Voronoi cell.

Fig. 6 shows a comparison between Basen cell and Voronoi cell, Fig. 7 shows the concept of Basen cell, and Fig. 8 shows the concept of Voronoi cell.

Referring to FIG. 7, the concept of a Basin cell will be described. When a point at which a slope of a function becomes zero is referred to as a stable equilibrium vector (SEV or a stable equilibrium point: SEP, hereinafter referred to as an equilibrium point or SEP) As can be seen, the Basen cell is the section that converges to the equilibrium point. Therefore, each equilibrium point can represent each basin cell.

Referring to FIG. 8, the Voronoi tessellation is a method of dividing a space by using a series of points called a seed, Is called a Voronoi cell, and a diagram obtained by dividing a space by Voronoi cells as shown in FIG. 8 is called a Voronoi diagram. Therefore, the points included in the Voronoi cell are closer to the seed of the corresponding cell than the seeds of the other cells, so that the Voronoi cell technique is suitable for clustering.

In this specification, the seed of each Voronoi cell is defined as a representative equilibrium point (REP). This point is also an equilibrium point in the concept of FIG. That is, the clustering apparatus 10 according to an embodiment of the present invention uses an equilibrium point, which is a representative point of each Basin cell, as a representative point representing each Voronoi cell.

FIG. 6 shows a well-known two-dimensional artificial data set, (a) 2D-N200, (b) Fourth artificial data set, (Curves) and Voronoi cells (straight lines) for two-circles, -Gaussian, (c) Two-circles, and (d) Three-circles. As shown, it can be seen that a cluster can be defined such that the boundary of the Voronoi cell is very close to the boundary of the basin cell.

However, it is shown that Voronoi cell and basin cell do not coincide exactly because the bridge of two adjacent representative points may not be located exactly at the basin cell boundary. Therefore, as described above, the clustering allocation step S200 according to an exemplary embodiment of the present invention includes a step S230 of determining whether data labeling can be performed using a Voronoi cell. The detailed concept will be described in FIG.

Figure 9 shows a partial graph.

As shown in the figure, after dividing the representative point into a plurality of partial graphs, except for the trunk line S3 having a weight value lower than the reference value, a weight graph having weights assigned to the trunk lines is constructed. Each subgraph constitutes one Voronoi cell, that is, one cluster. The concept of the weight graph and the partial graph are discussed in the application for clustering using Basen cell of the present invention as described above, and therefore, detailed description thereof will be omitted.

FIG. 10 conceptually shows an embodiment using a Voronoi cell instead of a Basin cell.

The figure shows a Voronoi cell corresponding to one Basin cell (the same point (R) is the same) in order to more specifically describe the contents described in FIG. Since the bases cell C1 is expressed by a curve as shown in the drawing, the data can be accurately labeled even if the shape of the cluster is complicated. On the other hand, since the Voronoi cell (C2) simplifies the boundary to a straight line, the operation of defining a cluster and the operation of assigning data to the cluster can be simplified.

(c), since the point P1 exists in both the Voronoi cell and the Basen cell, even if Voronoi cell is used instead of Basin cell, the point belonging to the cluster can be labeled. However, since the point P2 exists outside the Voronoi cell but exists in the basin cell, if the Voronoi cell is used instead of the Basin cell, it is not labeled as a point belonging to the cluster. Therefore, the method used in Basen cell should be used .

Therefore, after calculating the nearest representative point (R) closest to the data point (P2) (S220), it is determined whether the ratio of the data and the distance between two representative points is equal to or greater than a certain level (S230 ).

Hereinbefore, before explaining the Voronoi cell-based clustering according to an embodiment of the present invention, the basic concept has been described. Now, specific formulas and algorithms of Voronoi cell-based clustering according to an embodiment of the present invention will be described in detail.

As described above, the kernel support clustering method is roughly divided into two steps. The first step is to create a support function indicating the boundaries of the cluster, and the second step is a labeling step to assign each data cluster through the support function.

1) Support function

Is a function that sends n-dimensional data in positive real numbers and measures the support of data distribution. This is divided into m subset of m connected as follows.

(EQ-1-1)

here

Are connected to each other in a subset

Which is determined by the number of clusters. As described above, if Fig. 5- (a) is the original data, Fig. 5- (b) shows the result divided into m clusters.

The support function can be generated by the SVDD (Support Vector Domain Description) method. The SVDD method maps data points to a high dimensional feature space and uses a method to find the minimum radius containing most of the data in this feature space. When the sphere thus found is reconstructed into a data space, it is divided into m closed sets representing each cluster. At this time,

(EQ-1-2) is obtained from the kernel support function trained by the SVDD method.

(EQ-1-2)

In the above equation

,

Are the support vectors and their coefficients. The method used in the present invention is not limited to SVDD but can be applied to all methods for obtaining a support function experimentally through data.

2) After obtaining the support function, there is a labeling step. There are several methods for support clustering labeling. Proximity graph-based methods include Delaunay Diagram (DD), Minimum Spanning Tree (MST), and K-nearest neighbor (KNN). However, the method of MST and KNN is likely to miss important corners, so it is likely to be clumped differently from actual boundaries. To solve this problem, an equilibrium vector-based cluster (EVC) method using topological features of trained support functions has been proposed.

The EVC consists of two steps. The first step is to use the support function

(EQ-1-3), which is related to the dynamic system.

(EQ-1-3)

here

All data

(Positive Definite Symmetric Matrix).

(EQ-1-3), the time t and the starting point x related to the dynamic system,

The existence of a unique solution for

Is twice differentiable

Of the norm is limited.

A state vector < RTI ID = 0.0 >

Is the equilibrium vector of a system such as Eq. (1-3).

in

The Jacobian procession

If there is no eigenvalue of zero

Is called Hyperbolic. Hyperbolic Equilibrium Vector

(I) Stable Equilibrium Vector (SEV) if all eigenvalues of Hessian are positive and (ii) Otherwise Unstable Equilibrium Vector (UEV). Where SEV is the support function

Which is the local minimum value of.

There is a baseline cell associated with EQ-1-3 as an important concept in EVC for inductive learning. The baseline cell of SEV s is the set of all points that converge to s as the dynamic process proceeds as follows. :

And SEV s basin cell Basen

Is defined as a closure of

Respectively. :

At this time, the boundary of the basin cell is called the basin cell boundary

Respectively. One of the good features of the basin cell is that the entire data space is partitioned into Basin cells of several SEVs under certain conditions, as in the following equation (EQ-1-4).

(EQ-1-4)

here

Is a set of SEVs of the dynamic system of Eq. (1-3). Therefore, the entire data space can be divided into Basin cells. At this time, all the data points converge to a specific SEV by the dynamic process as described above, so that the bases cell can be grasped by finding SEVs. Then, in the second step, label with the adjacency matrix of SEVs or label the entire data space using Transition Equilibrium Vector (TEV).

This method is also applicable to the clusters of formula (EQ-1-1)

Lt; RTI ID = 0.0 > (EQ-1-5) <

Which is a way to expand to.

(EQ-1-5)

here

Cluster

&Lt; / RTI > As a result, the entire data space can be divided into further expanded clusters as shown in the following equation (EQ-1-6).

(EQ-1-6)

Thus, bounded support vectors located outside the community boundaries, as well as data that is not used for learning, can be subjected to inductive cluster labeling.

The SMO algorithm, a widely used algorithm for solving quadratic programming in support-based clustering,

(

: Number of data). The time complexity of cluster labeling, on the other hand,

(

: The number of support vectors constituting the support function, and m: the number of calculation times of each kernel function).

To speed up cluster labeling and less support vector estimates, we use two main approximation methods for fast computation. 1) First, the support estimates of the entire data are approximated by the support estimates obtained from the small data samples. 2) Second, the partitioned cluster boundary using basin cell is approximated by Voronoi cell. The Voronoi cell approximates each data to the nearest representative point without an explicit call to the kernel support function. The following sections describe the details of this method.

<Sampling and support Level function construction>

The support level function is a positive scalar function

. And the level set

Estimates the support domain of the data distribution (or the domain surrounding most of the data points). This is divided into m subset of m connected as follows.

(EQ-2-1)

In general, support level functions use the SVDD (Support Vector Domain Description) method, or Gaussian process clustering or other kernel distribution function estimation. In the present invention, SVDD is used to estimate the support function. Nevertheless, the method can easily be applied to any support function. The dataset

, The SVDD method maps data points to a high dimensional feature space and uses a method to find the minimum radius containing most of the data in this feature space. The sphere thus obtained is divided into m closed sets representing the respective clusters when the data space is restored to the data space. Gaussian kernel

And the kernel support function trained by the SVDD method is obtained.

(EQ-2-2)

In the above equation

,

Are the support vectors and their coefficients.

In many real-world large data sets, most of the data is typically concentrated in a slightly narrower area, and very small parts of the data in general can quite well explain the distribution of the entire data, particularly its support. We will now justify the use of the sampled data points and explain how the trained Gaussian kernel support function represents the support of the class conditional distribution. In this case, generalization error limit theorem is applied.

Result 1: From a probability distribution P that does not contain discrete components, iid (independently and identically distributed)

Samples of

. For the trained Gaussian kernel support function given by equation (EQ-2-2), the level set

Indicates a portion derived from the level value r. At this time,

Sample probability

All

The following are satisfied.

(EQ-2-3)

, function

All support vectors of

About

when,

(EQ-2-4)

This result indicates that the right-hand side of equation (EQ-2-2) is sufficiently small to fit the purpose of the present invention for large amounts of data

<<

If this is large

Sized sample data can sufficiently capture the support of the data distribution.

<Sample space division>

After a support level function is created from a small percentage of the sampled data set for the entire large data set, the next step is to assign a population label to the sampled data and the remaining data (or unknown test data). This process can be easily done using inductive Basen-cell based labeling methods because they divide the entire data space into basin cells while naturally expanding the clusters in which they are made.

One important drawback to this labeling process, however, is that it can be very costly to calculate because it uses a gradient of functions with the same complexity as the number of trained support level functions or SVs. To overcome these drawbacks and to facilitate the labeling process, the present invention uses a new labeling method that does not require calculation of the function.

function

State vector

end

This point is defined as the representative equilibrium point (REP) (this point is also referred to as a stable equilibrium point (SEP)). The core idea of the present invention is to assign the cluster label of the REP that is closest to each REP-like data to determine the cluster of each data. Define the Voronoi cell of REP s as a closure of all the nearest dataset s as follows.

(EQ-2-5)

All of the data in s is assigned to the same cluster as s. The method then separates the entire data space into Voronoi cells of the REP as follows. (This is similar to the way that the entire data space is separated into Basin cells in the existing Basin-cell based method.)

(EQ-2-6)

Two REPs are defined to be adjacent when each Voronoi cell meets each other. Let us define the segment weights between two adjacent REPs as follows.

(EQ-2-7)

That is,

Wow

Is the smallest one of the function values at the points on the overlapping portion of the boundary of the region of FIG. Let's call the point with the smallest function value as the bridging point. (In this case, the connection point is different from the transition equilibrium point (TEV) in the conventional method.)

Once you have learned these terms, you can use the weighted full graph

Can be defined.

: REP

field

: All REP

With a line,

The weight of each line segment

The

Path between

Adjacent

,

The following equations are given.

(EQ-2-8)

On the other hand, if there are no adjacent REP pairs on all paths,

As shown in Fig.

The obtained definition of the distance between REPs is similar to the definition of path distance in the prior art, but there is a difference. In the prior art

The above definition uses a connection point with the minimum value at the boundary of the Voronoi cell, whereas the TEV point with the minimum value of the above points is used.

Geometrical distance

Is the smallest value among the points having the maximum function value among the points on the path connecting the two REPs through the connection point to the other REP by escaping one REP. graph

, The number K of clusters can be adjusted by assigning one cluster for each REP and merging the clusters of two REPs hierarchically into one cluster using the same procedure as in the agglomerative hierarchical clustering. this

And Voronoi cells are able to quickly process the labeling process for data,

And it saves a lot of storage space because it uses only G information such as REP and segment weight.

now

Partial graph of

To

,

sign

Line segments with REP pairs

. Basen cell of REP

In general

&Lt; / RTI >

.

(EQ-2-9)

In general, Voronoi cells do not satisfy these attributes. However, in many cases,

A set of levels

Lt; / RTI > That is,

Wow

end

In the same connected comp. If

Wow

A set of levels

Belongs to the same community, and the opposite holds.

As described above, FIG. 6 shows a comparison of Voronoi and Basen cells for a well-known two-dimensional artificial data set, 2D-N200, Four-Gaussian, Two-circles and Three-circles. The figure shows that the Voronoi cell and the Basen cell do not match exactly because the connection points of two adjacent REPs may not be located exactly at the baseline cell boundary.

But

The boundaries generated by the baseline cell are similar to those generated by baseline cell,

Is maintained. The reason for this is that, in a region where the data near the representative point is concentrated, the data share the same label. And this means that for each REP

Satisfying

And the union of Voronoi cells constituting a specific cluster for the data set has a function value larger than r as in Basen cells.

The algorithm of the present invention is described for actual implementation. The present invention can be divided into three stages. : 1) Construct a support level function through a sampled dataset based on 'result 1'. 2) Search the REP of the support level function so that the cluster can be labeled. 3) Label the remaining data using fast and slow steps.

<Algorithm: Voronoi Cell-based kernel support Clustering>

&Lt; 1) support Level Function Configuration>

1: given data set

To select a sample at a specific rate that maximizes the representativeness of the sample from the training data set

. Remaining set

Constitute the remaining data pointers.

2:

Support Level Function from

And the optimum value

.

&Lt; 2) support Of the level function REP Search REP Assign a cluster label to the>

1: given support level function

Using

A set of REPs from

. This work is done in one of two ways. In this case, the second method is mainly used for fast searching.

Each data point

As an initial point

Apply any available and efficient optimization techniques for.

Are divided into several sets and the dynamic system (EQ-1-3) is applied only to the center point.

2:

To REP,

(EQ-2-7) or (EQ-2-8) to generate graphs of two adjacent REPs as a set of weighted segments.

.

3: Using single-linkage amalgamation

Only those lines with smaller segment weights are collected

Partial graph of

. And

A label of the corresponding cluster is assigned to the cluster.

&Lt; 3) Labeling >

1: Percentage of ambiguity ratio limit

.

2: for

do

3:

The nearest REP

And the second closest REP

.

4: if

Or

Have the same cluster label

The community label of

As shown in FIG.

5: else

We apply the dynamic system (EQ-1-3)

. And

The community label of

As shown in FIG.

6: end if

7: end for

At this time, the following issues arise in implementation. First, as shown in 2) -2, the distance between two adjacent REPs

The overlapping boundary points of two adjacent Voronoi cells, i.e.,

X must satisfy the following equation.

(EQ-2-10)

Midpoint between

Slope direction descending from

The projection of the plane S of the plane S is given as follows. :

then

Is a straight line in the plane S having the decreasing direction d from the starting point x0. From this, we can approximate the minimization problem in (EQ-2-7) as follows. :

(EQ-2-11)

This is a one-dimensional optimization problem that can be solved easily with the well-known cubic interpolation line search technique as in. In step 2) all the sample spaces are divided into several Voronoi cells assigned cluster labels. Therefore, the next step in determining the label at the test point is to determine the Voronoi cell to which the test point belongs. This can be obtained by finding the REP that is closest to the test point.

In step 3)

The nearest REP

And the second nearest REP

An exhaustive search algorithm can be used to calculate the distance from all REPs to this point. However, instead of using this search algorithm to obtain such a close REP to speed labeling, a more advanced method such as kdtree is used.

Also

The ambiguity ratio threshold for confirming the expression

Is used,

Two adjacent

In the ambiguous region between the Voronoi cells, it will invoke the slower method as in step 3) -5. In this invention, in order to more rapidly process the clustering process,

As shown in Fig. This is because most of the data used in the experiment belongs to the same first cluster, the second closest REP belongs to the same cluster,

Is much smaller than 1. If the dataset contains a significant percentage of data points with ambiguity ratios close to 1,

Or set a certain percentage and add a slower step.

The concrete equations and algorithms of baseline cell-based clustering according to an embodiment of the present invention have been described above.

11 to 15 show experimental results for verifying the effect of the present invention.

The figures are the results of experiments performed on several well-known benchmark data sets to demonstrate the clustering performance of the present invention. The performance of the present invention is compared to the performance of conventional techniques for the same data.

FIGS. 11 and 12 show the labeling time for each test data. From the experiment results, it is possible to confirm the improvement result of the labeling time of the present invention.

Figures 13-15 refer to images that have been partially applied in conventional methods for application to image segmentation. Experimental results comparing with other clustering methods are shown in FIG. 14, and image information and refinement time are shown in FIG. From the experimental results, it can be seen that the segmentation result of the present invention is divided into regions of similar colors and the speed is much faster than other methods.

As described above, according to the present invention, it is experimentally confirmed that the labeling time is effectively reduced while ensuring accuracy in the case of a large capacity problem in the clustering method. In particular, it can be seen that the clustering method of the present invention is applied well to actual large capacity problems such as image segmentation.

The clustering apparatus 10 according to an exemplary embodiment of the present invention is a technology that can be applied to various industries that need to analyze clustering of large amounts of data. It can be applied to analyzing large volume data such as customer data in marketing analysis of distributors and electronic companies. It can be applied in the field of identifying objects in images or moving images by image segmentation.

(Example 1) A new clustering method can be applied to image segmentation. Image segmentation can be used for image compression by identifying an object by recognizing a boundary between objects of a specific type in a specific image. However, image data is a large amount of data, especially in the case of high resolution images or moving images, it is difficult to process the image data in real time by the conventional method. However, according to the present invention, image segmentation can be performed in real time even in a large-capacity image data, and this can be applied to research fields such as computer visualization.

(Example 2) In the case of a financial institution such as a bank, it is necessary to use the customer data to judge whether or not to lend or analyze the customer group. However, if new customers increase, all data must be clustered again in order to cluster the customers. As data increases, the training time or cluster allocation time of new data increases. However, when the present invention is used, clusters can be allocated to new data without going through a training process in an inductive manner, and even if data is large in size, customer analysis can be performed in real time. In addition, when the present invention is used to control the number of customer taxa, it is possible to classify the entire customer into a desired number of customer groups and characterize each customer group by identifying the characteristics of customers belonging to the same group, This is possible.

Once again, the present invention provides a Voronoiell-based clustering technique that speeds up support-based clustering without loss of clustering performance. In contrast to previous studies that focused on optimizing only one of the two steps of training and labeling, the present invention optimizes both processes in terms of time and space complexity.

The labeling process and estimation of the kernel support involves a very large number of intensive numbers of calculations for the kernel in every support function call. In terms of computational complexity, SEV labeling takes place within a reasonable computation time, taking into account the training data sample size. This is because the number of SEVs is generally very small compared to the total number of data. However, the process of finding SEV corresponding to each data for all N training data requires considerable calculation time when N is very large. Finding the SEV in one piece of data involves finding the minimum value of the support function and its computational complexity

to be. Therefore, the computational complexity of finding the SEV of each of the N data is

.

This is why equilibrium-based clustering is difficult to apply to large-volume data problems such as image segmentation. SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a method that can be applied to a large capacity problem by improving the speed while maintaining the useful properties described above without impairing accuracy in the labeling step of clustering based on the support function .

To overcome the computational burden of the two steps, the present invention introduces two main approximation techniques, one that introduces sampling based on general error bounds in estimating the support, the second assumes that the dynamic system Instead of looking for a REP that converges on the data, the first step for quick labeling is simply to use the similarity inference that finds the nearest REP from each data.

By calculating the number of samples that can sufficiently capture the support of the data distribution through the first method, and calculating the support function using only this data, the cost of constructing the support function

The cost of the system can be improved. In addition to demonstrating that these samples can represent support well, they not only allowed us to use the sampled data, but they also added theoretical strengths.

Second, at the labeling stage, the labeling time complexity of existing methods is approximately

(

: The number of support vectors constituting the support function, and m: the number of calculations of each kernel function), the computation cost can be greatly increased because the gradient of the function having the same complexity proportional to the number of SVs is used. In order to solve this problem, it is necessary to divide the entire data space into a method of finding a REP nearest to each data. By using a new labeling method which does not require calculation of a function, It can be confirmed that it is dramatically improved without deteriorating performance.

It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

10: Clustering device
100: cluster definition unit
200: cluster allocation unit
300: Data Store

Claims

In a clustering apparatus,
A clustering defining unit that extracts sample data from a data set stored in a data store, defines each cluster from the extracted sample data, calculates and labels representative points of the clusters; And
Calculating a distance between each data of the data set and a representative point of a cluster closest to the closest cluster and a distance between the representative point and a representative point of the next closest cluster, And a clustering allocator for allocating each of the data to one of the defined clusters,
Wherein said representative point is calculated for use as a seed of Voronoi cell.

The method according to claim 1,
Wherein the clustering allocator allocates the data to a cluster of the closest representative point when the ratio is less than or equal to a threshold value.

The method according to claim 1,
Wherein the clustering allocator calculates a target representative point using the dynamical system when the ratio is equal to or larger than the threshold and allocates the data to the calculated cluster of the representative point.

The method according to claim 1,
Wherein the clustering definition unit calculates a support function from the sample data and calculates the representative point using the support function.

The method according to claim 1,
The clustering definition unit constructs a weight graph using the representative points, and deletes trunks having weights less than a threshold value in the weight graph, and divides the trunks into partial graphs so that each divided partial graph is included in each cluster Wherein each cluster is defined.

CLAIMS 1. A method for clustering using a clustering device,
A clustering definition step of extracting sample data from a data set stored in a data store, defining each cluster from the extracted sample data, calculating and labeling representative points of the clusters; And
Calculating a distance between each data of the data set and a representative point of a cluster closest to the closest cluster and a distance between the representative point and a representative point of the next closest cluster, And allocating each data of the data set to one of the defined clusters based on the ratio of the distances of the clusters to the clusters,
Wherein the representative point is calculated for use as a seed of a Voronoi cell.

The method according to claim 6,
Wherein the clustering allocation step allocates the data to a cluster of the closest representative point when the ratio is less than or equal to a threshold value.

The method according to claim 6,
Wherein the clustering allocation step calculates a target representative point by using a dynamical system when the ratio is equal to or greater than a threshold and allocates the data to the calculated cluster of representative representative points.

The method according to claim 6,
Wherein the clustering definition step calculates a support function from the sample data and calculates the representative point using the support function.

The method according to claim 6,
Wherein the clustering definition step includes the steps of constructing a weighted graph using the representative points, deleting trunks having weights less than a threshold value in the weighted graph, and dividing the trunks into partial graphs, Wherein each cluster is defined as a cluster.