CN105824853B

CN105824853B - Clustering device and method

Info

Publication number: CN105824853B
Application number: CN201510011593.6A
Authority: CN
Inventors: 刘博�; 胡卫松; 刘晓炜; 唐建波; 刘启亮
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2015-01-09
Filing date: 2015-01-09
Publication date: 2020-06-26
Anticipated expiration: 2035-01-09
Also published as: CN105824853A

Abstract

There is provided a clustering device including: a spatial neighborhood selection unit configured to select a spatial neighborhood of each object in the spatial data set; a core point calculation unit configured to calculate a core point in the spatial data set, the core point having a similar attribute value to other objects within a spatial neighborhood of the core point; an extraction unit configured to extract a core point in the spatial data set and objects located within a spatial neighborhood of the core point, constituting a corresponding spatial data subset; and a merging unit configured to cluster the spatial data subsets. A clustering method is also provided. By adopting the method and the device, the significance of the spatial hierarchy clustering result can be effectively judged, and the obtained clustering result is more reliable.

Description

Clustering device and method

Technical Field

The present application relates to the field of data analysis, and in particular, to a clustering device and method.

Background

Clustering is a means of classifying things or partitioning natural units. Hierarchical clustering is the most widely used clustering method at present, and mainly adopts a strategy of gradual merging from bottom to top, for example, people form families, families form communities, communities form cities, cities form countries, and the cities are merged layer by layer to form larger grouping units.

The geographical scholars put forward spatial hierarchical clustering by taking spatial position adjacency as constraint on the basis of traditional hierarchical clustering, and the spatial hierarchical clustering is used for clustering data (such as temperature, monitoring data of environment monitoring sites and the like) with geographical spatial position and attribute information. Spatial hierarchical clustering can achieve a range of results, but cannot directly give which clustering result(s) are really needed and interesting by the user. In practical applications, a user is required to set a clustering end condition (i.e., a condition for terminating merging, such as the number of classes, the number of elements contained in each class at most, etc.). It is difficult for users without prior and background knowledge to set a proper termination condition, which limits the use of the spatial hierarchical clustering method. How to automatically determine the appropriate threshold (or termination condition) based on the characteristics of the data itself is a challenge that is currently pending.

The spatial clustering algorithm considering the special attributes can be roughly divided into: a partition-based approach, a density-based approach, and a hierarchy-based approach. Most of the partitioning methods are to define a class of mixed distances by weighting the space and the special attributes, and then to perform clustering by adopting a traditional partitioning and clustering method. There are two significant problems with this type of approach: on one hand, the weights of the spatial attributes and the thematic attributes are difficult to define; on the other hand, the randomness of the initial cluster center selection results in poor cluster quality. The density-based approach requires thresholding to control the expansion of spatial clusters, similar to a region-growing hierarchical clustering. The hierarchy-based method is mainly based on traditional agglomerative hierarchical clustering and is expanded by adding spatial constraint. In cohesive merging, each spatial entity computes the similarity of thematic attributes only with its spatially neighboring entities.

In practical application, the judgment of the clustering end condition is an unsolved core difficult problem of spatial hierarchical clustering analysis.

Disclosure of Invention

The invention provides a self-adaptive spatial hierarchical clustering technology which can effectively judge the end condition of spatial hierarchical clustering and partition out areas with similar attributes in spatial data. Meanwhile, the invention adopts a statistical test method to determine the natural aggregation structure in the data according to the characteristics of the data, thereby avoiding the problem of artificially setting the threshold.

According to an aspect of the present invention, there is provided a clustering apparatus including: a spatial neighborhood selection unit configured to select a spatial neighborhood of each object in the spatial data set; a core point calculation unit configured to calculate a core point in the spatial data set, the core point having a similar attribute value to other objects within a spatial neighborhood of the core point; an extraction unit configured to extract a core point in the spatial data set and objects located within a spatial neighborhood of the core point, constituting a corresponding spatial data subset; and a merging unit configured to cluster the spatial data subsets.

In one embodiment, objects in a spatial data set have a spatial location attribute and a topical attribute.

In one embodiment, the spatial neighborhood selection unit is configured to select the spatial neighborhood of each object in the spatial data set according to at least one of the following domain construction techniques: eps-neighborhoods, KNN neighborhoods, and Delaunay triangulation first order neighborhoods.

In one embodiment, the attribute values include at least one of: variance, standard deviation, variance to mean ratio, local spatial autocorrelation index.

In one embodiment, the core point calculation unit is configured to: calculating a local variance of each object in the spatial data set with other objects in the spatial neighborhood of the object; performing multiple random rearrangements in the spatial data set; and judging whether each object is a core point or not based on the significance of the local variance of the object.

In one embodiment, the core point calculation unit is configured to: calculating a local variance of each object in the spatial data set with other objects in the spatial neighborhood of the object; performing Bootstrap random sampling for multiple times in a spatial data set; and judging whether each object is a core point or not based on the significance of the local variance of the object.

In one embodiment, the core point calculation unit is configured to: calculating a local variance of each object in the spatial data set with other objects in the spatial neighborhood of the object; performing multiple random rearrangements in the spatial data set; calculating the local variance of each object and carrying out chi-square curve fitting; and calculating the significance of the local variance of each object according to the chi-square curve so as to judge whether the object is a core point.

In one embodiment, the core point calculation unit is configured to: calculating a spatial distance of each object in the spatial data set from the attribute values of other objects within a spatial neighborhood of the object; performing nuclear density estimation; and searching local extreme points of the nuclear density curved surface as nuclear points.

In one embodiment, the core point calculation unit is configured to: calculating a local spatial autocorrelation index for each object in the spatial data set; performing Z inspection on the local space autocorrelation index; and taking an object with a significant local spatial autocorrelation index as a core point.

In one embodiment, the merging unit is configured to: selecting two adjacent most similar objects or clusters, and combining the two objects or clusters into a new space cluster; judging whether the attribute values of the objects in the new space cluster are similar or not; and if the two objects or clusters are similar, merging the two objects or clusters and continuously selecting two adjacent most similar objects or clusters for merging, otherwise, not merging the two selected objects or clusters.

In one embodiment, the merging unit is configured to determine whether the attribute values of the objects within the new spatial cluster are similar by: performing multiple random rearrangements in the new spatial cluster, wherein after each rearrangement, a core point in the new spatial cluster is calculated according to the significance judgment of the local variance; and if the core point is stable, the attribute values of the objects in the new spatial cluster are similar; otherwise, the attribute values of the objects within the new spatial cluster are not similar.

In one embodiment, the merging unit is configured to determine whether the attribute values of the objects within the new spatial cluster are similar by: judging whether the attribute values of the objects in the new space cluster are subjected to normal distribution; if the normal distribution is obeyed, judging whether the difference of the attribute mean values of the two objects or clusters is significantly different; and if there is no significant difference, the attribute values of the objects within the new spatial cluster are similar; otherwise, the attribute values of the objects within the new spatial cluster are not similar.

According to an aspect of the present invention, there is provided a clustering method including: selecting a spatial neighborhood of each object in the spatial dataset; calculating a core point in the spatial data set, wherein the core point has similar attribute values with other objects in the spatial neighborhood of the core point; extracting core points in the spatial data set and objects positioned in the spatial neighborhood of the core points to form a corresponding spatial data subset; and clustering the spatial data subsets.

In one embodiment, the spatial neighborhood of each object in the spatial data set is selected according to at least one of the following field construction techniques: eps-neighborhoods, KNN neighborhoods, and Delaunay triangulation first order neighborhoods.

In one embodiment, computing the core points in the spatial data set comprises: calculating a local variance of each object in the spatial data set with other objects in the spatial neighborhood of the object; performing multiple random rearrangements in the spatial data set; and judging whether each object is a core point or not based on the significance of the local variance of the object.

In one embodiment, computing the core points in the spatial data set comprises: calculating a local variance of each object in the spatial data set with other objects in the spatial neighborhood of the object; performing Bootstrap random sampling for multiple times in a spatial data set; and judging whether each object is a core point or not based on the significance of the local variance of the object.

In one embodiment, computing the core points in the spatial data set comprises: calculating a local variance of each object in the spatial data set with other objects in the spatial neighborhood of the object; performing multiple random rearrangements in the spatial data set; calculating the local variance of each object and carrying out chi-square curve fitting; and calculating the significance of the local variance of each object according to the chi-square curve so as to judge whether the object is a core point.

In one embodiment, computing the core points in the spatial data set comprises: calculating a spatial distance of each object in the spatial data set from the attribute values of other objects within a spatial neighborhood of the object; performing nuclear density estimation; and searching local extreme points of the nuclear density curved surface as nuclear points.

In one embodiment, computing the core points in the spatial data set comprises: calculating a local spatial autocorrelation index for each object in the spatial data set; performing Z inspection on the local space autocorrelation index; and taking an object with a significant local spatial autocorrelation index as a core point.

In one embodiment, clustering the subsets of spatial data comprises: selecting two adjacent most similar objects or clusters, and combining the two objects or clusters into a new space cluster; judging whether the attribute values of the objects in the new space cluster are similar or not; and if the two objects or clusters are similar, merging the two objects or clusters and continuously selecting two adjacent most similar objects or clusters for merging, otherwise, not merging the two selected objects or clusters.

In one embodiment, whether the attribute values of the objects within the new spatial cluster are similar is determined by: performing multiple random rearrangements in the new spatial cluster, wherein after each rearrangement, a core point in the new spatial cluster is calculated according to the significance judgment of the local variance; and if the core point is stable, the attribute values of the objects in the new spatial cluster are similar; otherwise, the attribute values of the objects within the new spatial cluster are not similar.

In one embodiment, whether the attribute values of the objects within the new spatial cluster are similar is determined by: judging whether the attribute values of the objects in the new space cluster are subjected to normal distribution; if the normal distribution is obeyed, judging whether the difference of the attribute mean values of the two objects or clusters is significantly different; and if there is no significant difference, the attribute values of the objects within the new spatial cluster are similar; otherwise, the attribute values of the objects within the new spatial cluster are not similar.

The invention determines a proper clustering ending condition according to the characteristics of the numbers to obtain a natural aggregation structure in the data. Therefore, the method can effectively judge the significance of the spatial level clustering result, and the obtained clustering result is more reliable.

Drawings

The above and other features of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings, in which:

fig. 1 is a block diagram illustrating a clustering apparatus according to an embodiment of the present invention.

Fig. 2 shows a schematic diagram of eps-neighborhoods, KNN neighborhoods (K ═ 4), and Delaunay triangulation first order neighborhoods.

FIG. 3 illustrates a clustering process according to one embodiment of the invention.

Fig. 4 shows a spatial data set according to an embodiment of the invention.

FIG. 5 shows a schematic diagram of a spatial neighborhood according to an embodiment of the invention.

FIG. 6 shows a schematic diagram of local variance according to one embodiment of the invention.

Fig. 7 shows a schematic diagram of a result after random reordering according to one embodiment of the invention.

Fig. 8 shows a schematic diagram of local variance values according to an embodiment of the invention.

FIG. 9 shows a schematic diagram of a core point, according to one embodiment of the invention.

FIG. 10 is a diagram illustrating a core point and a cluster generated after merging hierarchical clusters according to an embodiment of the present invention.

FIG. 11 illustrates a schematic diagram of randomly reordering attribute values within a merged cluster according to one embodiment of the present invention.

FIG. 12 shows a diagram of final clustering results according to one embodiment of the invention.

Fig. 13 is a flowchart illustrating a clustering method according to another embodiment of the present invention.

Detailed Description

The principles and operation of the present invention will become apparent from the following description of specific embodiments thereof, taken in conjunction with the accompanying drawings. It should be noted that the present invention should not be limited to the specific embodiments described below. In addition, a detailed description of known technologies that are not related to the present invention is omitted for the sake of brevity.

Fig. 1 is a block diagram illustrating a clustering apparatus according to an embodiment of the present invention. As shown in fig. 1, the clustering device 10 includes a spatial neighborhood selecting unit 110, a epipolar computing unit 120, an extracting unit 130, and a merging unit 140.

The spatial neighborhood selection unit 110 is configured to select a spatial neighborhood for each object in the spatial data set. In this embodiment, the spatial data set is a set of spatial objects that need to be spatially clustered. The space object mainly comprises space position attributes and thematic attributes, such as geographic position coordinates of environment monitoring sites and data of air temperature, precipitation, PM2.5 and the like recorded on the sites.

In this embodiment, the spatial neighborhood of each object is selected according to a certain spatial neighborhood construction technique.

For example, the spatial neighborhood selection unit 110 may construct a spatial neighborhood for each object in the spatial data set using eps-neighborhoods, KNN neighborhoods, based on edge point adjacency (4-neighborhoods, 8-neighborhoods, etc.) or using graph theory (e.g., Delaunay triangulation first order neighborhoods), among other techniques. Fig. 2 shows a schematic diagram of eps-neighborhoods, KNN neighborhoods (K ═ 4), and Delaunay triangulation first order neighborhoods.

The epipolar computation unit 120 is configured to compute an epipolar point in the spatial data set that has similar attribute values to other objects within a spatial neighborhood of the epipolar point.

In the present embodiment, the attribute value may include a variance, a standard deviation, a ratio of the variance to a mean, or a local spatial autocorrelation index (local Moran's I index or g (d) index). For example, the following attribute values may be employed:

local variance:

local Moran's I:

G(d)：

wherein Z represents an attribute value of the object, NN (P)_i) Representing an object P_iE represents the average of the spatial neighborhood property values of the object.

A core point is a spatial object in spatial data that is similar in attribute value to other objects in its spatial neighborhood. Detecting the core points plays an important role in the generation of spatial clusters. The core points are the basic units that constitute the spatial clusters. In particular, the computation of the kernel points may be implemented using computation of spatial local autocorrelation indexes, statistical tests, and kernel density estimation. The basic idea of the method for spatial local autocorrelation index and statistical test is to regard the distribution of the spatial object attribute values in the whole data set as a specific realization of a zero hypothesis under the condition of the zero hypothesis (the distribution of the spatial object attribute values is independent of the spatial position, that is, the attribute values are completely randomly distributed in space), and judge the probability of occurrence of the distribution (or whether the distribution is a small probability event) by adopting the statistical test method. If the event is a small probability event, the attribute values of the space object and the surrounding adjacent objects are similar.

In the following, several specific examples of calculating the core point are described.

Example 1

In this example, the epipolar computation unit 120 first computes the local variance of each object in the spatial data set with other objects within the spatial neighborhood of the object. Then, the epipolar calculation unit 120 performs a plurality of random rearrangements in the spatial data set. Finally, the epipolar calculation unit 120 determines whether each object is an epipolar point based on the significance of the local variance of the object. The calculation process in this example 1 is described in detail below.

By adopting the method, the core point is searched according to the self characteristics of the data without the need of subjective parameter selection of the user when the core point is calculated, so that the method has stronger self-adaptive capability and can generate more stable results.

Example 2

In this example, the epipolar computation unit 120 first computes the local variance of each object in the spatial data set with other objects within the spatial neighborhood of the object.

Then, the checkpoint calculation unit 120 performs Bootstrap random sampling a plurality of times in the spatial data set. Bootstrap random sampling is to take the attribute values of all objects as a set, and aims at each object P_iSequentially randomly extracting n from the set_i(n_iIs an object P_iThe number of objects in the spatial neighborhood) and then calculates the object P once_iThe local variance of the data is randomly extracted for a plurality of times, and each random extraction is performed from the most original attribute value set, namely a back sampling process. For example, the existing set of attribute values is {1, 2, 3, 4}, P_iIf there are 3 objects in the neighborhood of the object space, Bootstrap is used to perform two operationsThe results of the sub-random sampling may be {1, 3, 3} and {2, 4, 2 }.

Finally, the epipolar calculation unit 120 determines whether each object is an epipolar point based on the significance of the local variance of the object.

Example 3

The inventors of the present invention have found that the local variance of the epipolar points approximately obeys a chi-squared distribution. Therefore, the local variance after rearrangement can be randomly rearranged for multiple times and calculated, then chi-square distribution curve fitting is carried out, namely, the parameter k of chi-square distribution is estimated through the local variance values after rearrangement for multiple times, and then the p-value of the local variance is calculated according to the chi-square distribution density function and the local variance.

Specifically, the epipolar calculation unit 120 calculates the local variance of each object in the spatial data set with other objects in the spatial neighborhood of the object, and then performs a plurality of random rearrangements in the spatial data set. Thereafter, the epipolar calculation unit 120 calculates the local variance of each object and performs chi-square curve fitting, and calculates the significance of the local variance of each object from the chi-square curve, thereby determining whether the object is an epipolar point.

In this example, the significance is calculated by fitting an approximate probability density curve, reducing the amount of calculation.

Example 4

In the present example, the epipolar computation unit 120 computes the spatial distance of each object in the spatial data set from the attribute values of other objects within the spatial neighborhood of the object. Then, the epipolar calculation unit 120 performs epipolar density estimation and searches for a local extreme point of the epipolar density surface as an epipolar point.

For example, the spatial distance and attribute distance between objects may be first defined by an inter-object similarity metric, as follows:

spatial distance:

attribute distance:

wherein (x)₁，y₁) And (x)₂，y₂) The spatial coordinates of objects P1 and P2 respectively,

property values for objects P1 and P2, respectively. And searching a local maximum value point of the nuclear density curved surface as a nuclear point by adopting a nuclear density estimation method. The kernel density is a non-parametric method for probability density estimation using a kernel function, which may be, for example, a gaussian kernel function. The position with high kernel density reflects that the space distance between the objects in the area is smaller than the distance between the attribute values, indirectly reflects the degree of homogeneity of the object attributes, the higher the degree of homogeneity is, the higher the kernel density is, the local maximum value point represents a local homogeneous area, and therefore the local maximum value point is taken as a kernel point.

Example 5

The local spatial autocorrelation index is a measure of the correlation of the same attribute at each spatial location within the study range with its surrounding neighboring locations. In this example, a local spatial autocorrelation indicator is employed to compute the core points. Specifically, the epipolar calculation unit 120 calculates a local spatial autocorrelation index for each object in the spatial data set. Then, the epipolar calculation unit 120 performs Z-test on the local spatial autocorrelation index, and takes an object in which the local spatial autocorrelation index is significant as an epipolar.

For example, the epipolar calculation unit 120 may first calculate a local spatial autocorrelation index (e.g., the local Moran's I and g (d) mentioned above, etc.). Assuming that the local Moran's I exponential approximation follows a normal distribution, a normal distribution statistic can be constructed

Wherein I_iRepresenting the local autocorrelation coefficient of object I, E (I)_i) And Var (I)_i) Representing local expectation and variance. The Z statistic is then examined. If Z is significant, the area where the object is located is considered to be homogeneous, and the locally significant object is taken as a core point.

Returning to fig. 1, the extraction unit 130 is configured to extract the core points in the spatial data set and the objects located within the spatial neighborhood of the core points, constituting the respective spatial data subsets. Since the core points are small regions with uniform spatial object properties, the uniform regions in the data set are all composed of these core points and objects in their spatial neighborhood. The properties of the other regions are randomly distributed or may be referred to as background or noise.

The merging unit 140 is configured to cluster the spatial data subsets. The merging unit 140 may select two adjacent most similar objects or clusters to merge into a new spatial cluster. Then, it is determined whether the attribute values of the objects within the new spatial cluster are similar. If so, merging the two objects or clusters and continuously selecting two adjacent objects or clusters which are most similar for merging, otherwise, not merging the two selected objects or clusters.

When determining whether the attribute values of the objects in the new spatial cluster are similar, the merging unit 140 may perform multiple random rearrangements in the new spatial cluster. Wherein after each rebinning, the epipolar points within the new spatial cluster are computed based on the local variance saliency decisions. If the core point is stable, the attribute values of the objects in the new space cluster are considered to be similar; otherwise, the attribute values of the objects within the new spatial cluster are considered dissimilar. In this example, there is a strong adaptability because the distribution of the data is not limited, but the homogeneity of the clusters is completely determined by the characteristics of the data itself.

Alternatively, the merging unit 140 may determine whether the attribute values of the objects within the new spatial cluster follow a normal distribution. If the normal distribution is obeyed, judging whether the difference of the property mean values of the two objects or the clusters is significant or not. If no significant difference exists, the attribute values of the objects in the new spatial cluster are considered similar; otherwise, the attribute values of the objects within the new spatial cluster are considered dissimilar. In this example, the attribute values of the objects within the cluster need not be randomly rearranged, but the assumption that the attribute values of the objects within the cluster approximately follow a normal distribution needs to be satisfied. Therefore, the present example needs to first check whether the attribute values of the objects in the cluster comply with the normal distribution assumption, and the checking method may adopt, for example, a QQ graph method.

After the combination of the combination unit 140, a series of homogeneous clusters are finally obtained. Where any two clusters cannot be merged into a larger homogenous cluster due to large differences in properties.

Next, a specific operation example of the clustering device 10 is given. In this example, with the variance as a metric of the attribute similarity, the epipolar calculation unit 120 calculates the epipolar points using example 1, and the merging unit 140 performs the discrimination of the intra-cluster homogeneity using a plurality of random rearrangements. The specific operation is as follows:

first, the spatial neighborhood selection unit 110 constructs a spatial neighborhood for each object in the spatial data set. For example, for spatial data set SD ═ { P ═ P₁，P₂，...，P_nThe eps-neighborhood is employed by spatial neighborhood selection unit 110 to determine a set of neighboring spatial objects for each spatial object. The eps-neighborhood construction method comprises the following steps: given a distance threshold eps, the spatial object P is used_iThe set of all spatial objects within a circle centered at an eps radius is called P_iSpatial neighborhood NN (P)_i). For example, NN (P) as shown in FIG. 2(a)_i)＝{P_i，P₁，P₂，P₃，P₄}。

The epipolar computation unit 120 measures a spatial object P by variance (since variance is the variance of the computed attribute values in the spatial neighborhood of the object, hereinafter referred to as local variance)_iNN (P) with other objects in its spatial neighborhood_i) The similarity between them is denoted as LV (Pi). The specific expression is as follows:

in the above formula, n_iIs NN (P)_i) Number of internal objects, and Z_pkRepresenting spatial objects P_iThe topic attribute value of (1).

Then, the epipolar calculation unit 120 randomly rearranges all the data and calculates the local variance of each object after rearrangement. In other words, the core point calculating unit 120 performs the saliency test by using a random rearrangement method in order to determine whether the thematic attribute of each spatial object is similar to that of its spatial neighboring objects. The specific operation is as follows:

(1) given a spatial data set SD ═ P₁，P₂，...，P_nCalculating the local variance P of each object respectively_k(P_k∈ sD) to form a local variance vector Lv (P)_k)。

(2) Randomly rearranging thematic attributes of the spatial data set SD, and calculating the local variance of each object to form a local variance vector which is recorded as

k≥1。

(3) Repeating the operation (2) m times, a randomly rearranged local variance matrix W ═ LV can be obtained₁，LV₂，...，LV_m]From this, the significance of the local variance (p-value) of each spatial object can be calculated:

in the above formula, W_ikThe values of the elements representing the ith row and the kth column of the matrix W, I (.) represents the indicator function.

Next, the epipolar computation unit 120 computes the epipolar points from the rearranged local variance matrix under a given significance level of α if P-value (P)_i) When the ratio is less than or equal to α, the name is P_iIn addition, in order to detect the core point more accurately, the following alternative operations can be performed, in which, since the inspection of the core point is a multiple hypothesis inspection problem, the significance level α can be further corrected by using the B-H method, and the specific steps are as follows:

first, P-value (P)_i) (

i

1, 2.. times.n) is ordered from small to large and is denoted as P-value (P)_(i)) (i ═ 1, 2.., n). Fromp-value(P₍₁₎) A step-wise backward detection is started. If i (i ═ 1, 2.., n) is present, k ═ max { i: p-value (P)_(i)) I is not less than i α/n, i is not less than 1 and not more than n, then the value is compared with P-value (P)_(i)) (

i

1, 2.. k) corresponding space object P_(i)When the amount of data is large, in order to improve efficiency, the significance level α ≈ 1/(m +1) (where m is the number of random rearrangements, such as 9999), if no kernel point is detected, the attribute is illustrated as being randomly distributed in space, no attribute aggregation region exists, the algorithm stops, and if a kernel point exists, the kernel point is extracted and the objects adjacent to the kernel point around the kernel point are subjected to spatial constraint hierarchical clustering.

The extraction unit 130 extracts the core points in the spatial data set and the objects located within the spatial neighborhood of the core points, constituting the corresponding spatial data subsets.

The merging unit 140 clusters the spatial data subsets extracted by the extracting unit 130. Specifically, the merging unit 140 finds two spatially adjacent objects in the spatial data set that are most similar to each other for merging. For example, the kernel points with the smallest local variance may be combined by Ward hierarchical clustering to generate clusters. In the merging process, each spatial object only calculates the similarity of thematic attributes with the objects which are spatially adjacent to the spatial object. FIG. 3 illustrates a clustering process according to one embodiment of the invention. In fig. 3, Li-j denotes the jth cluster of the ith layer. First, each object is regarded as a cluster, and the similarity of each cluster and its spatially adjacent clusters is calculated respectively. And merging the two most similar clusters, and updating the adjacency relation of each spatial cluster. And repeating the steps to obtain a clustering result of a plurality of layers.

In the spatial hierarchical clustering process, the merging unit 140 performs stability check on spatial cluster kernel points obtained by merging each time. That is, after the attribute of the subject in the cluster is randomly rearranged, the merging unit 140 determines whether the core point in the cluster is still similar to the attribute of the spatially neighboring subject. If the kernel points in the clusters are still similar to the attributes of the adjacent objects after a large number of random rearrangement operations in the clusters, the merged spatial cluster can be considered as a mean cluster. The specific operation is as follows:

(1) assuming that the cluster A and the cluster B are combined into a new space cluster S, and respectively calculating the local variance of the kernel point of S;

(2) keeping the thematic attribute value of each core point unchanged, randomly rearranging the thematic attributes of other objects in the space cluster S, and calculating the P-value (P) of each core point after rearrangement_i) (i ═ 1.., g). In order to reduce the calculation amount, the calculation can utilize the local variance matrix W generated in the global random rearrangement process to calculate the kernel point P according to the formula_iAt a given significance level of α, determine whether all of the epipolar points within the rearranged spatial cluster S are also epipolar points, and use I_kMarking, if the total number of the cluster kernel points is not changed, I_k＝1。

(3) Repeatedly performing operation (2) n times, the significance of the spatial cluster S can be calculated by the following formula:

under the given cluster significance level β, if p-value (S) is less than or equal to β, the spatial cluster S is a homogeneous cluster, if the spatial cluster S is a homogeneous cluster, the merging unit 140 merges the clusters A and B, updates the spatial adjacency relation of all the current spatial clusters, and performs the next merging until the data cannot be merged to generate the homogeneous cluster.

Hereinafter, a specific application example of the clustering device 10 shown in fig. 1 is described in detail with reference to fig. 4 to 12.

Fig. 4 shows a spatial data set according to an embodiment of the invention. As shown in fig. 4, the data set is a data matrix of size 8x 10. In this example, spatial neighborhood selection unit 110 employs an 8-neighborhood to construct a spatial neighborhood for each object in the spatial data set. For example, taking row 4, column 3 as shown in FIG. 4 as an example, the spatial neighborhood constructed by the spatial neighborhood selecting unit 110 is shown in FIG. 5. The epipolar calculation unit 120 calculates the local variance. For example, for the spatial domain shown in FIG. 5, the local variance is as follows:

E＝(2+1+2+2+3+2+1+2+3)/9＝2

the epipolar computation unit 120 computes the local variance of all objects and other objects in their spatial neighborhood, with the results shown in FIG. 6.

Next, the kernel point calculation unit 120 randomly rearranges the original data set to generate a simulation data set sd (k) (k ═ 1, 2, 3.., m). For the sake of simplicity, the results of only one of the random rearrangements will be described, as shown in FIG. 7.

The epipolar calculation unit 120 calculates the local variance of each object in the rearranged analog data as follows:

E＝(54+35+81+3+3+76+2+5+44)/9＝33.67

repeating m 9999 times, each object can get m local variance values, as shown in fig. 8.

The epipolar calculation unit 120 determines the significance of each object local variance according to the following equation:

LV(P_i)＝{sd₀，sd₁，sd₂，...，sd_m}；

wherein, i (g) is an indication function, that is, the value is 1 when the determination condition is true, otherwise the value is 0.

Giving a significance level of α of 0.05 if object P_iP-value (P)_i) Less than a given significance level value of α, is significant, P_iReferred to as the core point. Fig. 9 shows a schematic diagram of the kernel points calculated by the kernel point calculating unit 120.

If there are no core points in the dataset, the clustering process stops. Otherwise, the extraction unit 130 extracts a new data subset comprising the epipolar point and points within the spatial neighborhood of the epipolar point (as shown in the left half of fig. 10). Then, the merging unit 140 performs hierarchical clustering of spatial constraints. And carrying out statistical judgment on the space clusters newly combined and generated in the hierarchical clustering process. Assume that 2 clusters C1 and C2 are generated after multiple hierarchical clustering merges, as shown in the right half of fig. 10. The next merging will find two clusters with the most similar attributes, namely the merged cluster C1 and the cluster C2, which are spatially adjacent, and perform significance determination on whether the merged cluster is a homogeneous cluster. Assuming that C1 and C2 are merged into one larger cluster, C12, the attribute values within the merged cluster are randomly rearranged, as shown in fig. 11.

Then, merging section 140 calculates the local variance of the kernel points rearranged inside C12, and determines the significance of the local variance. If the local variance after cluster kernel point rearrangement is still significant, a count is made, i.e. I_k1 is ═ 1; otherwise I_k0. The above rearrangement and calculation of local variance are repeated m times to calculate the significance of the cluster. At a given level of significance, it was judged whether cluster C12 was a homogeneous cluster condition. If the condition is satisfied, merge C1 and C2, otherwise do not merge. And then, performing next spatial hierarchical clustering and distinguishing the significance, and stopping calculation until no homogeneous cluster can be generated through merging. The final clustering results are shown in FIG. 12 (two clusters C1 and cluster C2).

FIG. 13 is a flow diagram illustrating a clustering method according to one embodiment of the invention. As shown in fig. 13, method 1300 begins at step S1310.

In step S1320, a spatial neighborhood of each object in the spatial data set is selected. Objects in a spatial dataset may have a spatial location attribute and a topical attribute. Preferably, the spatial neighborhood of each object in the spatial data set may be selected according to an eps-neighborhood, a KNN neighborhood, or a Delaunay triangulation first order neighborhood.

At step S1330, a kernel point in the spatial data set is computed that has similar attribute values to other objects within the spatial neighborhood of the kernel point. Preferably, the attribute values may include at least one of: variance, standard deviation, variance to mean ratio, local spatial autocorrelation index.

Preferably, calculating the epipolar points in the spatial data set may comprise: calculating a local variance of each object in the spatial data set with other objects in the spatial neighborhood of the object; performing multiple random rearrangements in the spatial data set; and judging whether each object is a core point or not based on the significance of the local variance of the object.

Alternatively, computing the epipolar points in the spatial data set may comprise: calculating a local variance of each object in the spatial data set with other objects in the spatial neighborhood of the object; performing Bootstrap random sampling for multiple times in a spatial data set; and judging whether each object is a core point or not based on the significance of the local variance of the object.

Alternatively, computing the epipolar points in the spatial data set may comprise: calculating a local variance of each object in the spatial data set with other objects in the spatial neighborhood of the object; performing multiple random rearrangements in the spatial data set; calculating the local variance of each object and carrying out chi-square curve fitting; and calculating the significance of the local variance of each object according to the chi-square curve so as to judge whether the object is a core point.

Alternatively, computing the epipolar points in the spatial data set may comprise: calculating a spatial distance of each object in the spatial data set from the attribute values of other objects within a spatial neighborhood of the object; performing nuclear density estimation; and searching local extreme points of the nuclear density curved surface as nuclear points.

Alternatively, computing the epipolar points in the spatial data set may comprise: calculating a local spatial autocorrelation index for each object in the spatial data set; performing Z inspection on the local space autocorrelation index; and taking an object with a significant local spatial autocorrelation index as a core point.

In step S1340, the core points in the spatial data set and the objects located in the spatial neighborhood of the core points are extracted to form corresponding spatial data subsets.

In step S1350, the spatial data subsets are clustered. For example, two objects or clusters that are most similar to each other may be selected and merged into a new spatial cluster. Then, it is determined whether the attribute values of the objects within the new spatial cluster are similar. If so, merging the two objects or clusters and continuously selecting two adjacent objects or clusters which are most similar for merging, otherwise, not merging the two selected objects or clusters.

Preferably, whether the attribute values of the objects in the new spatial cluster are similar is determined by: performing multiple random rearrangements in the new spatial cluster, wherein after each rearrangement, a core point in the new spatial cluster is calculated according to the significance judgment of the local variance; and if the core point is stable, the attribute values of the objects in the new spatial cluster are similar; otherwise, the attribute values of the objects within the new spatial cluster are not similar.

Alternatively, whether the attribute values of the objects within the new spatial cluster are similar is determined by: judging whether the attribute values of the objects in the new space cluster are subjected to normal distribution; if the normal distribution is obeyed, judging whether the difference of the attribute mean values of the two objects or clusters is significantly different; and if there is no significant difference, the attribute values of the objects within the new spatial cluster are similar; otherwise, the attribute values of the objects within the new spatial cluster are not similar.

Finally, the method 1300 ends at step S1360.

By adopting the method and the device, the significance of the spatial hierarchy clustering result can be effectively judged, and the obtained clustering result is more reliable.

It should be understood that the above-described embodiments of the present invention can be implemented by software, hardware, or a combination of both software and hardware. For example, various components within the systems in the above embodiments may be implemented by a variety of devices, including but not limited to: analog circuits, digital circuits, general purpose processors, Digital Signal Processing (DSP) circuits, programmable processors, Application Specific Integrated Circuits (ASIC), Field Programmable Gate Arrays (FPGA), programmable logic devices (CPLD), and the like.

In addition, those skilled in the art will appreciate that the initial parameters described in the embodiments of the present invention may be stored in a local database, may be stored in a distributed database, or may be stored in a remote database.

Furthermore, embodiments of the invention disclosed herein may be implemented on a computer program product. More specifically, the computer program product is one of the following: there is a computer readable medium having computer program logic encoded thereon that, when executed on a computing device, provides related operations for implementing the above-described aspects of the present invention. When executed on at least one processor of a computing system, the computer program logic causes the processor to perform the operations (methods) described in embodiments of the present invention. Such arrangements of the invention are typically provided as downloadable software images, shared databases, etc. arranged or encoded in software, code and/or other data structures on a computer readable medium such as an optical medium (e.g., CD-ROM), floppy or hard disk or other medium such as firmware or microcode on one or more ROM or RAM or PROM chips or in one or more modules. The software or firmware or such configurations may be installed on a computing device to cause one or more processors in the computing device to perform the techniques described in embodiments of the present invention.

Although the present invention has been described in conjunction with the preferred embodiments thereof, it will be understood by those skilled in the art that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention. Accordingly, the present invention should not be limited by the above-described embodiments, but should be defined by the appended claims and their equivalents.

Claims

1. A clustering device, comprising:

a spatial neighborhood selection unit configured to select a spatial neighborhood of each object in a spatial dataset, wherein the spatial dataset comprises data related to weather;

a kernel point calculation unit configured to calculate a kernel point in the spatial data set, the kernel point having similar attribute values to all other objects within a spatial neighborhood of the kernel point;

the extracting unit is configured to extract all the core points obtained by calculation of the core point calculating unit and all the objects located in the spatial neighborhood of the core points from the spatial data set to form corresponding spatial data subsets; and

a merging unit configured to cluster the spatial data subsets to obtain a clustering result of the weather-related data.

2. The apparatus of claim 1, wherein objects in the spatial dataset have a spatial location attribute and a topical attribute.

3. The apparatus of claim 1, wherein the spatial neighborhood selection unit is configured to select the spatial neighborhood of each object in the spatial data set according to at least one of the following domain construction techniques: eps-neighborhoods, KNN neighborhoods, and Delaunay triangulation first order neighborhoods.

4. The device of claim 1, wherein the attribute value comprises at least one of: variance, standard deviation, variance to mean ratio, local spatial autocorrelation index.

5. The apparatus of claim 1, wherein the epipolar computation unit is configured to:

calculating a local variance of each object in the spatial data set with other objects in the spatial neighborhood of the object;

performing multiple random rearrangements in the spatial data set; and

and judging whether each object is a core point or not based on the significance of the local variance of the object.

6. The apparatus of claim 1, wherein the epipolar computation unit is configured to:

performing Bootstrap random sampling for multiple times in a spatial data set; and

7. The apparatus of claim 1, wherein the epipolar computation unit is configured to:

performing multiple random rearrangements in the spatial data set;

calculating the local variance of each object and carrying out chi-square curve fitting; and

and calculating the significance of the local variance of each object according to the chi-square curve so as to judge whether the object is a core point.

8. The apparatus of claim 1, wherein the epipolar computation unit is configured to:

calculating a spatial distance of each object in the spatial data set from the attribute values of other objects within a spatial neighborhood of the object;

performing nuclear density estimation; and

and searching local extreme points of the kernel density curved surface as kernel points.

9. The apparatus of claim 1, wherein the epipolar computation unit is configured to:

calculating a local spatial autocorrelation index for each object in the spatial data set;

performing Z inspection on the local space autocorrelation index; and

and taking an object with a significant local spatial autocorrelation index as a core point.

10. The device of claim 1, wherein the merging unit is configured to:

selecting two adjacent most similar objects or clusters, and combining the two objects or clusters into a new space cluster;

judging whether the attribute values of the objects in the new space cluster are similar or not; and

if so, merging the two objects or clusters and continuously selecting two adjacent objects or clusters which are most similar for merging, otherwise, not merging the two selected objects or clusters.

11. The apparatus according to claim 10, wherein the merging unit is configured to determine whether the attribute values of the objects within the new spatial cluster are similar by:

performing multiple random rearrangements in the new spatial cluster, wherein after each rearrangement, a core point in the new spatial cluster is calculated according to the significance judgment of the local variance; and

if the core point is stable, the attribute values of the objects in the new spatial cluster are similar; otherwise, the attribute values of the objects within the new spatial cluster are not similar.

12. The apparatus according to claim 10, wherein the merging unit is configured to determine whether the attribute values of the objects within the new spatial cluster are similar by:

judging whether the attribute values of the objects in the new space cluster are subjected to normal distribution;

if the normal distribution is obeyed, judging whether the difference of the attribute mean values of the two objects or clusters is significantly different; and

if there is no significant difference, the attribute values of the objects within the new spatial cluster are similar; otherwise, the attribute values of the objects within the new spatial cluster are not similar.

13. A clustering method, comprising:

selecting a spatial neighborhood for each object in a spatial dataset, wherein the spatial dataset includes data related to weather;

calculating a core point in the spatial data set, the core point having similar attribute values to all other objects in a spatial neighborhood of the core point;

extracting all the calculated core points and all the objects located in the spatial neighborhood of the core points from the spatial data set to form corresponding spatial data subsets; and

and clustering the spatial data subset to obtain a clustering result of the data related to the weather.

14. The method of claim 13, wherein objects in the spatial dataset have a spatial location attribute and a topical attribute.

15. The method of claim 13, wherein the spatial neighborhood of each object in the spatial data set is selected according to at least one of the following domain construction techniques: eps-neighborhoods, KNN neighborhoods, and Delaunay triangulation first order neighborhoods.

16. The method of claim 13, wherein the attribute values comprise at least one of: variance, standard deviation, variance to mean ratio, local spatial autocorrelation index.

17. The method of claim 13, wherein computing the epipolar points in the spatial dataset comprises:

performing multiple random rearrangements in the spatial data set; and

18. The method of claim 13, wherein computing the epipolar points in the spatial dataset comprises:

19. The method of claim 13, wherein computing the epipolar points in the spatial dataset comprises:

performing multiple random rearrangements in the spatial data set;

20. The method of claim 13, wherein computing the epipolar points in the spatial dataset comprises:

performing nuclear density estimation; and

21. The method of claim 13, wherein computing the epipolar points in the spatial dataset comprises:

performing Z inspection on the local space autocorrelation index; and

22. The method of claim 13, wherein clustering spatial data subsets comprises:

23. The method of claim 22, wherein determining whether the attribute values of the objects within the new spatial cluster are similar is performed by:

24. The method of claim 22, wherein determining whether the attribute values of the objects within the new spatial cluster are similar is performed by: