KR101577249B1 - Device and method for voronoi cell-based support clustering - Google Patents
Device and method for voronoi cell-based support clustering Download PDFInfo
- Publication number
- KR101577249B1 KR101577249B1 KR1020140031027A KR20140031027A KR101577249B1 KR 101577249 B1 KR101577249 B1 KR 101577249B1 KR 1020140031027 A KR1020140031027 A KR 1020140031027A KR 20140031027 A KR20140031027 A KR 20140031027A KR 101577249 B1 KR101577249 B1 KR 101577249B1
- Authority
- KR
- South Korea
- Prior art keywords
- data
- cluster
- clustering
- representative point
- representative
- Prior art date
Links
Images
Landscapes
- Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
Abstract
The clustering definition unit extracts sample data from a data set stored in a data storage and defines each cluster from the extracted sample data. And a clustering allocator allocating each data of the data set to one of the defined clusters, wherein the clustering definition unit calculates a representative point usable as a seed of a Voronoi cell, And defining the clusters by labeling the representative points.
Description
The present invention relates to a clustering apparatus and method.
In clustering, similar or related items are grouped into a plurality of clusters (hereinafter referred to as clusters or clusters), and the clustering is shown in the space as shown in the example of FIG. 5 . In other words, clustering classifies data of each object and their characteristics with similar data into the same cluster, and each cluster is different.
This clustering analysis is used to characterize clustering data and is used in a wide variety of fields such as machine learning, image analysis, information retrieval, pattern recognition, and data analysis. Thus, various clustering methods have been developed.
Among them, kernel support clustering, that is, support based clustering using the kernel, is able to create a complex type of clustering boundary as compared with other clustering methods, and has excellent ability to deal with outlier data, There are many advantages that are being studied. The inventor of the present invention has also filed a patent for a Basen-cell-based clustering technique using a dynamical system (Application No. 10-2007-00844468).
However, such conventional support-based clustering has a problem that the load of clustering operation is too high. Especially when processing large amounts of data. Therefore, a clustering apparatus and method which can define complex type clusters and have high computational load with high accuracy are needed.
Korean Patent No. 10-1024038 ("Cluster configuration method of cluster sensor network and sensor network to which the method is applied") discloses a configuration for clustering sensor nodes of a sensor network in connection with the present invention.
In addition, US Patent Application Publication No. US-A-2010-0036647 (" Efficient computation of Voronoi diagrams of general generators in general spaces and uses thereof ") discloses a configuration for generating a Voronoi diagram.
SUMMARY OF THE INVENTION It is an object of the present invention to provide a clustering apparatus and method with high computational load and high accuracy.
According to an aspect of the present invention, there is provided a clustering apparatus including: a clustering defining unit that extracts sample data from a data set stored in a data storage and defines clusters from the extracted sample data; And a clustering allocator allocating each data of the data set to one of the defined clusters, wherein the clustering definition unit calculates a representative point usable as a seed of a Voronoi cell, And defining the clusters by labeling the representative points.
According to a second aspect of the present invention, there is provided a clustering method using a clustering apparatus, comprising the steps of: extracting sample data from a data set stored in a data store; defining clusters from the extracted sample data; Clustering definition step; And a clustering allocation step of allocating each data of the data set to one of the defined clusters, wherein the clustering definition step calculates a representative point that can be used as a seed of a Voronoi cell And defining each of the clusters by labeling the representative points.
The present invention obtains the effect that the computation load is low and the accuracy is high in the clustering apparatus and method.
The computational load can be greatly reduced by using Voronoi cells instead of baseline cell clustering with high computational complexity.
In addition, using baseline cell-based clustering for the required data, it is possible to define complex types of clusters and maintain the advantage of accurate clustering.
1 shows a structure of a clustering apparatus according to an embodiment of the present invention.
FIG. 2 shows a flow of a clustering method according to an embodiment of the present invention.
Figure 3 illustrates the flow of clustering definition steps in accordance with an embodiment of the present invention.
FIG. 4 illustrates a flow of the clustering allocation step according to an embodiment of the present invention.
Figure 5 shows an embodiment of clustering.
Figure 6 shows a comparison of Basen cell and Voronoi cell.
Fig. 7 shows the concept of a basin cell.
Figure 8 shows the concept of a Voronoi cell.
Figure 9 shows a partial graph.
FIG. 10 conceptually shows an embodiment using a Voronoi cell instead of a Basin cell.
11 to 15 show experimental results for verifying the effect of the present invention.
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, which will be readily apparent to those skilled in the art. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.
Throughout the specification, when a part is referred to as being "connected" to another part, it includes not only "directly connected" but also "electrically connected" with another part in between . Also, when an element is referred to as "comprising ", it means that it can include other elements as well, without departing from the other elements unless specifically stated otherwise.
Hereinafter, the structure of the clustering apparatus according to an embodiment of the present invention and the flow of the clustering method according to an embodiment of the present invention will be briefly described with reference to the structural view of FIG. 1 and the flow chart of FIG. 2 to FIG. 5 through 10, the concept of baseline cell-based clustering according to an embodiment of the present invention will be described in detail, and specific mathematical expressions and algorithms will be described. .
1 shows a structure of a clustering apparatus according to an embodiment of the present invention.
The
The
The two steps are performed by the
The data to be clustered and the clustering result may be stored in the
In this case, by defining a cluster by using a basin cell and labeling the data, the data can be accurately classified into complex types of clusters, but the computational load is high. Therefore, the
FIG. 2 shows a flow of a clustering method according to an embodiment of the present invention.
First, a cluster is defined by dividing a sample space (S100). Extracts sample data from the data set stored in the
Next, data is allocated to each cluster (S200). That is, the data stored in the
FIGS. 3 to 4 illustrate the flow of each step of the clustering method according to an embodiment of the present invention. As described above, the flow of each step will be briefly described and the details will be described later.
Figure 3 illustrates the flow of clustering definition steps in accordance with an embodiment of the present invention.
In the cluster definition (S100), first, a data sample is extracted (S110).
Next, the support function and the optimum value are calculated (S120). The support function is conceptually defined as a contour line when the data is expressed in the feature space, as will be described later.
Next, a dynamic system is constructed using the support function and a representative point is calculated (S130). The representative point is used as a seed point of the Voronoi cell, which will be described later.
Next, a weighted graph is constructed using representative points (S140).
Next, it is converted into a partial graph (S150). In the weighted graph, truncation of trunks having weights less than the threshold value is divided into a plurality of partial graphs.
Next, the cluster definition is completed by labeling each representative point using the calculated partial graph (S160).
FIG. 4 illustrates a flow of the clustering allocation step according to an embodiment of the present invention.
The cluster allocation (S200) repeats the following steps for each data (S260) until the labeling is completed (S260).
First, the representative point nearest to the data and the nearest representative point are calculated (S220).
Next, it is determined whether the ratio of the data and the distance between two representative points is equal to or higher than a certain level or whether the two representative points belong to the same cluster (S230). This is a step of determining whether labeling is to be performed using a Voronoi cell or a dynamic system used in a Basen-cell-based clustering technique, as described above.
When the Voronoi cell is used, the cluster is allocated to the closest representative point cluster (S240).
If the Voronoi cell can not be used, the representative point is calculated by using the dynamic system and assigned to the calculated cluster of representative points (S250).
5 through 10, the concept of baseline cell-based clustering according to an embodiment of the present invention will be described in detail.
Figure 5 shows an embodiment of clustering.
(a) is a diagram showing data stored in the
Therefore, clustering can be seen as dividing the space into multiple clusters. The inventor of the present invention has already applied for a patent to define the boundary of a cluster by using the concept of basin cell (Application No. 10-2007-00844468) However, since there is a problem that the calculation load is high as described above, the present invention simplifies the Voronoi cell.
Fig. 6 shows a comparison between Basen cell and Voronoi cell, Fig. 7 shows the concept of Basen cell, and Fig. 8 shows the concept of Voronoi cell.
Referring to FIG. 7, the concept of a Basin cell will be described. When a point at which a slope of a function becomes zero is referred to as a stable equilibrium vector (SEV or a stable equilibrium point: SEP, hereinafter referred to as an equilibrium point or SEP) As can be seen, the Basen cell is the section that converges to the equilibrium point. Therefore, each equilibrium point can represent each basin cell.
Referring to FIG. 8, the Voronoi tessellation is a method of dividing a space by using a series of points called a seed, Is called a Voronoi cell, and a diagram obtained by dividing a space by Voronoi cells as shown in FIG. 8 is called a Voronoi diagram. Therefore, the points included in the Voronoi cell are closer to the seed of the corresponding cell than the seeds of the other cells, so that the Voronoi cell technique is suitable for clustering.
In this specification, the seed of each Voronoi cell is defined as a representative equilibrium point (REP). This point is also an equilibrium point in the concept of FIG. That is, the
FIG. 6 shows a well-known two-dimensional artificial data set, (a) 2D-N200, (b) Fourth artificial data set, (Curves) and Voronoi cells (straight lines) for two-circles, -Gaussian, (c) Two-circles, and (d) Three-circles. As shown, it can be seen that a cluster can be defined such that the boundary of the Voronoi cell is very close to the boundary of the basin cell.
However, it is shown that Voronoi cell and basin cell do not coincide exactly because the bridge of two adjacent representative points may not be located exactly at the basin cell boundary. Therefore, as described above, the clustering allocation step S200 according to an exemplary embodiment of the present invention includes a step S230 of determining whether data labeling can be performed using a Voronoi cell. The detailed concept will be described in FIG.
Figure 9 shows a partial graph.
As shown in the figure, after dividing the representative point into a plurality of partial graphs, except for the trunk line S3 having a weight value lower than the reference value, a weight graph having weights assigned to the trunk lines is constructed. Each subgraph constitutes one Voronoi cell, that is, one cluster. The concept of the weight graph and the partial graph are discussed in the application for clustering using Basen cell of the present invention as described above, and therefore, detailed description thereof will be omitted.
FIG. 10 conceptually shows an embodiment using a Voronoi cell instead of a Basin cell.
The figure shows a Voronoi cell corresponding to one Basin cell (the same point (R) is the same) in order to more specifically describe the contents described in FIG. Since the bases cell C1 is expressed by a curve as shown in the drawing, the data can be accurately labeled even if the shape of the cluster is complicated. On the other hand, since the Voronoi cell (C2) simplifies the boundary to a straight line, the operation of defining a cluster and the operation of assigning data to the cluster can be simplified.
(c), since the point P1 exists in both the Voronoi cell and the Basen cell, even if Voronoi cell is used instead of Basin cell, the point belonging to the cluster can be labeled. However, since the point P2 exists outside the Voronoi cell but exists in the basin cell, if the Voronoi cell is used instead of the Basin cell, it is not labeled as a point belonging to the cluster. Therefore, the method used in Basen cell should be used .
Therefore, after calculating the nearest representative point (R) closest to the data point (P2) (S220), it is determined whether the ratio of the data and the distance between two representative points is equal to or greater than a certain level (S230 ).
Hereinbefore, before explaining the Voronoi cell-based clustering according to an embodiment of the present invention, the basic concept has been described. Now, specific formulas and algorithms of Voronoi cell-based clustering according to an embodiment of the present invention will be described in detail.
As described above, the kernel support clustering method is roughly divided into two steps. The first step is to create a support function indicating the boundaries of the cluster, and the second step is a labeling step to assign each data cluster through the support function.
1) Support function
Is a function that sends n-dimensional data in positive real numbers and measures the support of data distribution. This is divided into m subset of m connected as follows.(EQ-1-1)
here
Are connected to each other in a subset Which is determined by the number of clusters. As described above, if Fig. 5- (a) is the original data, Fig. 5- (b) shows the result divided into m clusters.The support function can be generated by the SVDD (Support Vector Domain Description) method. The SVDD method maps data points to a high dimensional feature space and uses a method to find the minimum radius containing most of the data in this feature space. When the sphere thus found is reconstructed into a data space, it is divided into m closed sets representing each cluster. At this time,
(EQ-1-2) is obtained from the kernel support function trained by the SVDD method.(EQ-1-2)
In the above equation
, , Are the support vectors and their coefficients. The method used in the present invention is not limited to SVDD but can be applied to all methods for obtaining a support function experimentally through data.2) After obtaining the support function, there is a labeling step. There are several methods for support clustering labeling. Proximity graph-based methods include Delaunay Diagram (DD), Minimum Spanning Tree (MST), and K-nearest neighbor (KNN). However, the method of MST and KNN is likely to miss important corners, so it is likely to be clumped differently from actual boundaries. To solve this problem, an equilibrium vector-based cluster (EVC) method using topological features of trained support functions has been proposed.
The EVC consists of two steps. The first step is to use the support function
(EQ-1-3), which is related to the dynamic system.(EQ-1-3)
here
All data (Positive Definite Symmetric Matrix). (EQ-1-3), the time t and the starting point x related to the dynamic system, The existence of a unique solution for Is twice differentiable Of the norm is limited. A state vector < RTI ID = 0.0 > Is the equilibrium vector of a system such as Eq. (1-3). in The Jacobian procession If there is no eigenvalue of zero Is called Hyperbolic. Hyperbolic Equilibrium Vector (I) Stable Equilibrium Vector (SEV) if all eigenvalues of Hessian are positive and (ii) Otherwise Unstable Equilibrium Vector (UEV). Where SEV is the support function Which is the local minimum value of.There is a baseline cell associated with EQ-1-3 as an important concept in EVC for inductive learning. The baseline cell of SEV s is the set of all points that converge to s as the dynamic process proceeds as follows. :
And SEV s basin cell Basen
Is defined as a closure of Respectively. :
At this time, the boundary of the basin cell is called the basin cell boundary
Respectively. One of the good features of the basin cell is that the entire data space is partitioned into Basin cells of several SEVs under certain conditions, as in the following equation (EQ-1-4).(EQ-1-4)
here
Is a set of SEVs of the dynamic system of Eq. (1-3). Therefore, the entire data space can be divided into Basin cells. At this time, all the data points converge to a specific SEV by the dynamic process as described above, so that the bases cell can be grasped by finding SEVs. Then, in the second step, label with the adjacency matrix of SEVs or label the entire data space using Transition Equilibrium Vector (TEV).This method is also applicable to the clusters of formula (EQ-1-1)
Lt; RTI ID = 0.0 > (EQ-1-5) < Which is a way to expand to.(EQ-1-5)
here
Cluster ≪ / RTI > As a result, the entire data space can be divided into further expanded clusters as shown in the following equation (EQ-1-6).(EQ-1-6)
Thus, bounded support vectors located outside the community boundaries, as well as data that is not used for learning, can be subjected to inductive cluster labeling.
The SMO algorithm, a widely used algorithm for solving quadratic programming in support-based clustering,
( : Number of data). The time complexity of cluster labeling, on the other hand, ( : The number of support vectors constituting the support function, and m: the number of calculation times of each kernel function).To speed up cluster labeling and less support vector estimates, we use two main approximation methods for fast computation. 1) First, the support estimates of the entire data are approximated by the support estimates obtained from the small data samples. 2) Second, the partitioned cluster boundary using basin cell is approximated by Voronoi cell. The Voronoi cell approximates each data to the nearest representative point without an explicit call to the kernel support function. The following sections describe the details of this method.
<Sampling and support Level function construction>
The support level function is a positive scalar function
. And the level set Estimates the support domain of the data distribution (or the domain surrounding most of the data points). This is divided into m subset of m connected as follows.(EQ-2-1)
In general, support level functions use the SVDD (Support Vector Domain Description) method, or Gaussian process clustering or other kernel distribution function estimation. In the present invention, SVDD is used to estimate the support function. Nevertheless, the method can easily be applied to any support function. The dataset
, The SVDD method maps data points to a high dimensional feature space and uses a method to find the minimum radius containing most of the data in this feature space. The sphere thus obtained is divided into m closed sets representing the respective clusters when the data space is restored to the data space. Gaussian kernel And the kernel support function trained by the SVDD method is obtained.(EQ-2-2)
In the above equation
, , Are the support vectors and their coefficients.In many real-world large data sets, most of the data is typically concentrated in a slightly narrower area, and very small parts of the data in general can quite well explain the distribution of the entire data, particularly its support. We will now justify the use of the sampled data points and explain how the trained Gaussian kernel support function represents the support of the class conditional distribution. In this case, generalization error limit theorem is applied.
Result 1: From a probability distribution P that does not contain discrete components, iid (independently and identically distributed)
Samples of . For the trained Gaussian kernel support function given by equation (EQ-2-2), the level set Indicates a portion derived from the level value r. At this time, Sample probability All The following are satisfied.(EQ-2-3)
, function All support vectors of About when,
(EQ-2-4)
This result indicates that the right-hand side of equation (EQ-2-2) is sufficiently small to fit the purpose of the present invention for large amounts of data
<< If this is large Sized sample data can sufficiently capture the support of the data distribution.<Sample space division>
After a support level function is created from a small percentage of the sampled data set for the entire large data set, the next step is to assign a population label to the sampled data and the remaining data (or unknown test data). This process can be easily done using inductive Basen-cell based labeling methods because they divide the entire data space into basin cells while naturally expanding the clusters in which they are made.
One important drawback to this labeling process, however, is that it can be very costly to calculate because it uses a gradient of functions with the same complexity as the number of trained support level functions or SVs. To overcome these drawbacks and to facilitate the labeling process, the present invention uses a new labeling method that does not require calculation of the function.
function
State vector end This point is defined as the representative equilibrium point (REP) (this point is also referred to as a stable equilibrium point (SEP)). The core idea of the present invention is to assign the cluster label of the REP that is closest to each REP-like data to determine the cluster of each data. Define the Voronoi cell of REP s as a closure of all the nearest dataset s as follows.(EQ-2-5)
All of the data in s is assigned to the same cluster as s. The method then separates the entire data space into Voronoi cells of the REP as follows. (This is similar to the way that the entire data space is separated into Basin cells in the existing Basin-cell based method.)
(EQ-2-6)
Two REPs are defined to be adjacent when each Voronoi cell meets each other. Let us define the segment weights between two adjacent REPs as follows.
(EQ-2-7)
That is,
Wow Is the smallest one of the function values at the points on the overlapping portion of the boundary of the region of FIG. Let's call the point with the smallest function value as the bridging point. (In this case, the connection point is different from the transition equilibrium point (TEV) in the conventional method.)Once you have learned these terms, you can use the weighted full graph
Can be defined.: REP field
: All REP With a line, The weight of each line segment The Path between Adjacent , The following equations are given.
(EQ-2-8)
On the other hand, if there are no adjacent REP pairs on all paths,
As shown in Fig.The obtained definition of the distance between REPs is similar to the definition of path distance in the prior art, but there is a difference. In the prior art
The above definition uses a connection point with the minimum value at the boundary of the Voronoi cell, whereas the TEV point with the minimum value of the above points is used.Geometrical distance
Is the smallest value among the points having the maximum function value among the points on the path connecting the two REPs through the connection point to the other REP by escaping one REP. graph , The number K of clusters can be adjusted by assigning one cluster for each REP and merging the clusters of two REPs hierarchically into one cluster using the same procedure as in the agglomerative hierarchical clustering. this And Voronoi cells are able to quickly process the labeling process for data, And it saves a lot of storage space because it uses only G information such as REP and segment weight.now
Partial graph of To , sign Line segments with REP pairs . Basen cell of REP In general ≪ / RTI > .(EQ-2-9)
In general, Voronoi cells do not satisfy these attributes. However, in many cases,
A set of levels Lt; / RTI > That is, Wow end In the same connected comp. If Wow A set of levels Belongs to the same community, and the opposite holds.As described above, FIG. 6 shows a comparison of Voronoi and Basen cells for a well-known two-dimensional artificial data set, 2D-N200, Four-Gaussian, Two-circles and Three-circles. The figure shows that the Voronoi cell and the Basen cell do not match exactly because the connection points of two adjacent REPs may not be located exactly at the baseline cell boundary.
But
The boundaries generated by the baseline cell are similar to those generated by baseline cell, Is maintained. The reason for this is that, in a region where the data near the representative point is concentrated, the data share the same label. And this means that for each REP Satisfying And the union of Voronoi cells constituting a specific cluster for the data set has a function value larger than r as in Basen cells.The algorithm of the present invention is described for actual implementation. The present invention can be divided into three stages. : 1) Construct a support level function through a sampled dataset based on '
<Algorithm: Voronoi Cell-based kernel support Clustering>
≪ 1) support Level Function Configuration>
1: given data set
To select a sample at a specific rate that maximizes the representativeness of the sample from the training data set . Remaining set Constitute the remaining data pointers.2:
Support Level Function from And the optimum value .≪ 2) support Of the level function REP Search REP Assign a cluster label to the>
1: given support level function
Using A set of REPs from . This work is done in one of two ways. In this case, the second method is mainly used for fast searching.Each data point
As an initial point Apply any available and efficient optimization techniques for.Are divided into several sets and the dynamic system (EQ-1-3) is applied only to the center point.
2:
To REP, (EQ-2-7) or (EQ-2-8) to generate graphs of two adjacent REPs as a set of weighted segments. .3: Using single-linkage amalgamation
Only those lines with smaller segment weights are collected Partial graph of . And A label of the corresponding cluster is assigned to the cluster.≪ 3) Labeling >
1: Percentage of ambiguity ratio limit
.2: for
do3:
The nearest REP And the second closest REP .4: if
Or Have the same cluster labelThe community label of As shown in FIG.
5: else
We apply the dynamic system (EQ-1-3)
. And The community label of As shown in FIG.6: end if
7: end for
At this time, the following issues arise in implementation. First, as shown in 2) -2, the distance between two adjacent REPs
The overlapping boundary points of two adjacent Voronoi cells, i.e., X must satisfy the following equation.(EQ-2-10)
Midpoint between Slope direction descending from The projection of the plane S of the plane S is given as follows. : then Is a straight line in the plane S having the decreasing direction d from the starting point x0. From this, we can approximate the minimization problem in (EQ-2-7) as follows. :
(EQ-2-11)
This is a one-dimensional optimization problem that can be solved easily with the well-known cubic interpolation line search technique as in. In step 2) all the sample spaces are divided into several Voronoi cells assigned cluster labels. Therefore, the next step in determining the label at the test point is to determine the Voronoi cell to which the test point belongs. This can be obtained by finding the REP that is closest to the test point.
In step 3)
The nearest REP And the second nearest REP An exhaustive search algorithm can be used to calculate the distance from all REPs to this point. However, instead of using this search algorithm to obtain such a close REP to speed labeling, a more advanced method such as kdtree is used.Also
The ambiguity ratio threshold for confirming the expression Is used, Two adjacent In the ambiguous region between the Voronoi cells, it will invoke the slower method as in step 3) -5. In this invention, in order to more rapidly process the clustering process, As shown in Fig. This is because most of the data used in the experiment belongs to the same first cluster, the second closest REP belongs to the same cluster, Is much smaller than 1. If the dataset contains a significant percentage of data points with ambiguity ratios close to 1, Or set a certain percentage and add a slower step.The concrete equations and algorithms of baseline cell-based clustering according to an embodiment of the present invention have been described above.
11 to 15 show experimental results for verifying the effect of the present invention.
The figures are the results of experiments performed on several well-known benchmark data sets to demonstrate the clustering performance of the present invention. The performance of the present invention is compared to the performance of conventional techniques for the same data.
FIGS. 11 and 12 show the labeling time for each test data. From the experiment results, it is possible to confirm the improvement result of the labeling time of the present invention.
Figures 13-15 refer to images that have been partially applied in conventional methods for application to image segmentation. Experimental results comparing with other clustering methods are shown in FIG. 14, and image information and refinement time are shown in FIG. From the experimental results, it can be seen that the segmentation result of the present invention is divided into regions of similar colors and the speed is much faster than other methods.
As described above, according to the present invention, it is experimentally confirmed that the labeling time is effectively reduced while ensuring accuracy in the case of a large capacity problem in the clustering method. In particular, it can be seen that the clustering method of the present invention is applied well to actual large capacity problems such as image segmentation.
The
(Example 1) A new clustering method can be applied to image segmentation. Image segmentation can be used for image compression by identifying an object by recognizing a boundary between objects of a specific type in a specific image. However, image data is a large amount of data, especially in the case of high resolution images or moving images, it is difficult to process the image data in real time by the conventional method. However, according to the present invention, image segmentation can be performed in real time even in a large-capacity image data, and this can be applied to research fields such as computer visualization.
(Example 2) In the case of a financial institution such as a bank, it is necessary to use the customer data to judge whether or not to lend or analyze the customer group. However, if new customers increase, all data must be clustered again in order to cluster the customers. As data increases, the training time or cluster allocation time of new data increases. However, when the present invention is used, clusters can be allocated to new data without going through a training process in an inductive manner, and even if data is large in size, customer analysis can be performed in real time. In addition, when the present invention is used to control the number of customer taxa, it is possible to classify the entire customer into a desired number of customer groups and characterize each customer group by identifying the characteristics of customers belonging to the same group, This is possible.
Once again, the present invention provides a Voronoiell-based clustering technique that speeds up support-based clustering without loss of clustering performance. In contrast to previous studies that focused on optimizing only one of the two steps of training and labeling, the present invention optimizes both processes in terms of time and space complexity.
The labeling process and estimation of the kernel support involves a very large number of intensive numbers of calculations for the kernel in every support function call. In terms of computational complexity, SEV labeling takes place within a reasonable computation time, taking into account the training data sample size. This is because the number of SEVs is generally very small compared to the total number of data. However, the process of finding SEV corresponding to each data for all N training data requires considerable calculation time when N is very large. Finding the SEV in one piece of data involves finding the minimum value of the support function and its computational complexity
to be. Therefore, the computational complexity of finding the SEV of each of the N data is .This is why equilibrium-based clustering is difficult to apply to large-volume data problems such as image segmentation. SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a method that can be applied to a large capacity problem by improving the speed while maintaining the useful properties described above without impairing accuracy in the labeling step of clustering based on the support function .
To overcome the computational burden of the two steps, the present invention introduces two main approximation techniques, one that introduces sampling based on general error bounds in estimating the support, the second assumes that the dynamic system Instead of looking for a REP that converges on the data, the first step for quick labeling is simply to use the similarity inference that finds the nearest REP from each data.
By calculating the number of samples that can sufficiently capture the support of the data distribution through the first method, and calculating the support function using only this data, the cost of constructing the support function
The cost of the system can be improved. In addition to demonstrating that these samples can represent support well, they not only allowed us to use the sampled data, but they also added theoretical strengths.Second, at the labeling stage, the labeling time complexity of existing methods is approximately
( : The number of support vectors constituting the support function, and m: the number of calculations of each kernel function), the computation cost can be greatly increased because the gradient of the function having the same complexity proportional to the number of SVs is used. In order to solve this problem, it is necessary to divide the entire data space into a method of finding a REP nearest to each data. By using a new labeling method which does not require calculation of a function, It can be confirmed that it is dramatically improved without deteriorating performance.It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.
The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.
10: Clustering device
100: cluster definition unit
200: cluster allocation unit
300: Data Store
Claims (10)
A clustering defining unit that extracts sample data from a data set stored in a data store, defines each cluster from the extracted sample data, calculates and labels representative points of the clusters; And
Calculating a distance between each data of the data set and a representative point of a cluster closest to the closest cluster and a distance between the representative point and a representative point of the next closest cluster, And a clustering allocator for allocating each of the data to one of the defined clusters,
Wherein said representative point is calculated for use as a seed of Voronoi cell.
Wherein the clustering allocator allocates the data to a cluster of the closest representative point when the ratio is less than or equal to a threshold value.
Wherein the clustering allocator calculates a target representative point using the dynamical system when the ratio is equal to or larger than the threshold and allocates the data to the calculated cluster of the representative point.
Wherein the clustering definition unit calculates a support function from the sample data and calculates the representative point using the support function.
The clustering definition unit constructs a weight graph using the representative points, and deletes trunks having weights less than a threshold value in the weight graph, and divides the trunks into partial graphs so that each divided partial graph is included in each cluster Wherein each cluster is defined.
A clustering definition step of extracting sample data from a data set stored in a data store, defining each cluster from the extracted sample data, calculating and labeling representative points of the clusters; And
Calculating a distance between each data of the data set and a representative point of a cluster closest to the closest cluster and a distance between the representative point and a representative point of the next closest cluster, And allocating each data of the data set to one of the defined clusters based on the ratio of the distances of the clusters to the clusters,
Wherein the representative point is calculated for use as a seed of a Voronoi cell.
Wherein the clustering allocation step allocates the data to a cluster of the closest representative point when the ratio is less than or equal to a threshold value.
Wherein the clustering allocation step calculates a target representative point by using a dynamical system when the ratio is equal to or greater than a threshold and allocates the data to the calculated cluster of representative representative points.
Wherein the clustering definition step calculates a support function from the sample data and calculates the representative point using the support function.
Wherein the clustering definition step includes the steps of constructing a weighted graph using the representative points, deleting trunks having weights less than a threshold value in the weighted graph, and dividing the trunks into partial graphs, Wherein each cluster is defined as a cluster.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020140031027A KR101577249B1 (en) | 2014-03-17 | 2014-03-17 | Device and method for voronoi cell-based support clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020140031027A KR101577249B1 (en) | 2014-03-17 | 2014-03-17 | Device and method for voronoi cell-based support clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
KR20150108173A KR20150108173A (en) | 2015-09-25 |
KR101577249B1 true KR101577249B1 (en) | 2015-12-14 |
Family
ID=54246287
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020140031027A KR101577249B1 (en) | 2014-03-17 | 2014-03-17 | Device and method for voronoi cell-based support clustering |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR101577249B1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20220060375A (en) * | 2020-11-04 | 2022-05-11 | 서울대학교산학협력단 | Method and apparatus for performing fair clustering through estimating fair distribution |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116152277B (en) * | 2023-03-10 | 2023-09-22 | 麦岩智能科技(北京)有限公司 | Map segmentation method and device, electronic equipment and medium |
-
2014
- 2014-03-17 KR KR1020140031027A patent/KR101577249B1/en active IP Right Grant
Non-Patent Citations (1)
Title |
---|
김호숙, 용환승, "가중치를 고려한 그래프 기반의 공간 클러스터링 알고리즘의 설계", 이화여자대학교 컴퓨터학과, 2002년. |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20220060375A (en) * | 2020-11-04 | 2022-05-11 | 서울대학교산학협력단 | Method and apparatus for performing fair clustering through estimating fair distribution |
KR102542451B1 (en) * | 2020-11-04 | 2023-06-12 | 서울대학교산학협력단 | Method and apparatus for performing fair clustering through estimating fair distribution |
Also Published As
Publication number | Publication date |
---|---|
KR20150108173A (en) | 2015-09-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110443281B (en) | Text classification self-adaptive oversampling method based on HDBSCAN (high-density binary-coded decimal) clustering | |
Fischer et al. | Bagging for path-based clustering | |
US8954365B2 (en) | Density estimation and/or manifold learning | |
CN113221065A (en) | Data density estimation and regression method, corresponding device, electronic device, and medium | |
Kim et al. | Improving discrimination ability of convolutional neural networks by hybrid learning | |
Richards et al. | Clustering and unsupervised classification | |
CN108564083A (en) | A kind of method for detecting change of remote sensing image and device | |
Reddy et al. | A Comparative Survey on K-Means and Hierarchical Clustering in E-Commerce Systems | |
KR101577249B1 (en) | Device and method for voronoi cell-based support clustering | |
Xu et al. | MSGCNN: Multi-scale graph convolutional neural network for point cloud segmentation | |
Zeybek | Inlier point preservation in outlier points removed from the ALS point cloud | |
Liu et al. | A new local density and relative distance based spectrum clustering | |
Ghoshal et al. | Estimating uncertainty in deep learning for reporting confidence: An application on cell type prediction in testes based on proteomics | |
Xu et al. | The image segmentation algorithm of colorimetric sensor array based on fuzzy C-means clustering | |
Zheliznyak et al. | Analysis of clustering algorithms | |
KR100895261B1 (en) | Inductive and Hierarchical clustering method using Equilibrium-based support vector | |
Yu et al. | Sparse reconstruction with spatial structures to automatically determine neighbors | |
Shi et al. | Fuzzy support tensor product adaptive image classification for the internet of things | |
Richards et al. | Clustering and Unsupervised Classification | |
Liu et al. | A novel local density hierarchical clustering algorithm based on reverse nearest neighbors | |
CN112800138A (en) | Big data classification method and system | |
CN113779287A (en) | Cross-domain multi-view target retrieval method and device based on multi-stage classifier network | |
Rao et al. | Common object discovery as local search for maximum weight cliques in a global object similarity graph | |
KR101133804B1 (en) | Fast kernel quantile clustering method for large-scale data | |
Chang et al. | Fast marching based superpixels generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
GRNT | Written decision to grant | ||
FPAY | Annual fee payment |
Payment date: 20181203 Year of fee payment: 4 |
|
FPAY | Annual fee payment |
Payment date: 20191203 Year of fee payment: 5 |