CN110781943A

CN110781943A - Clustering method based on adjacent grid search

Info

Publication number: CN110781943A
Application number: CN201910997760.7A
Authority: CN
Inventors: 李志猛; 王国锋; 赵坚; 黄钦
Original assignee: Tianjin Chengjian University
Current assignee: Tianjin Chengjian University
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2020-02-11

Abstract

The invention discloses a clustering method based on adjacent grid search strategy, which comprises the following steps of firstly, carrying out grid division on original data: dividing an original data set into a limited number of cells by using a multi-dimensional space grid, and performing denoising processing when necessary; then, carrying out grid clustering on the divided data: processing the denoised grid by utilizing a halo threshold value, and dividing the denoised grid into a halo cell and a core cell; establishing an adjacent grid operator for rapidly searching for adjacent cells of a cell; the clustering process is realized through two steps of core cell clustering and halo cell division, all core cells are divided into a plurality of class clusters through a traversal algorithm, and halo cells are divided into existing class clusters based on cell distance; and finally, carrying out clustering optimization according to the data characteristics and the user requirements. Compared with the prior art, the invention can provide a new clustering method aiming at the dimension of the rapidly-increased sample set, and can effectively identify the cluster with the complex boundary shape.

Description

Clustering method based on adjacent grid search

Technical Field

The invention relates to the technical field of unsupervised pattern recognition and data mining, in particular to a clustering method based on grids.

Background

With the development of big data and network technology, a great amount of data surplus appears in various disciplines and fields, so that cluster analysis becomes an increasingly important technology. The process of dividing a collection of physical or abstract objects into classes composed of similar objects is called clustering. The cluster generated by clustering is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters. With the application of clustering in various fields, higher robustness requirements are provided for a clustering algorithm. The following specific data sets are of increasing interest in many applications: (1) a noisy data set; (2) a large-scale dataset; (3) a high-dimensional dataset; (4) a data set having arbitrarily shaped clusters; (5) data sets with large density difference among classes; (6) highly overlapping datasets between classes.

At present, the mainstream clustering method mainly includes: a partition-based clustering method, a density-based clustering method, a hierarchy-based clustering method, a grid-based clustering method, and the like. The clustering method based on division can only find out the super-spherical clusters, the clustering method based on density is difficult to process data sets with higher noise level and high dimension, and the clustering method based on hierarchy has poor capability of processing noise data and data overlapped between classes. In addition, the above clustering methods are all implemented by directly processing data points, so when large-scale data sets are clustered, the operation time of the algorithm is remarkably increased, and the method is hardly suitable for engineering practice. Compared with the method, the data set is divided into a plurality of grid units by the grid-based clustering method, the number of the grid units is far smaller than the number of sample points in the data set, so that the running time of the algorithm is greatly improved, and the algorithm is not influenced by the number of the sample points. Different grid clustering algorithms are different in the processing method of the divided grids, and the STING adopts a top-down query method, namely, firstly, a query condition is set, a certain layer of grids meeting the query condition are returned from a certain layer, while the grids which do not meet the query condition are deleted and are not considered, the returned grids are extended to the next layer to continue to execute query operation, and the steps are repeated until the bottom layer. It can be seen that the process is a process of continuously deleting grids which do not meet the query condition, and if a cluster with a complex boundary is encountered, the shape of the boundary cannot be identified due to low resolution of the upper grid, so that a false deletion phenomenon occurs, and a clustering result is distorted. The WaveCluster considers the distribution density of data along a certain direction as a plurality of one-dimensional signals, and separates regions with frequent density change through wavelet decomposition, thereby completing the detection of the class cluster boundary and achieving the purpose of clustering. It is easy to see that the number of one-dimensional signals in the algorithm increases exponentially with the sample set dimension, so when the dimension is larger, the algorithm cannot be performed. CLIQUE is a method for finding all connected dense units in a greedy growing mode along each dimension from one dense unit to finally form a class cluster. This method can only find spherical clusters, for non-spherical clusters, the algorithm tends to break it down into a number of connected spherical clusters. The OptiGrid focuses on how to construct the optimal partition of the multi-dimensional sample set, and as for the partitioned grid space, the algorithm is simple and considers that some independent dense grids are the class clusters. The clustering effect of this method depends heavily on the projection algorithm and the meshing results produced by it. Through the above analysis, the current grid clustering algorithm has the following problems in grid processing: (1) clusters with complex boundary shapes cannot be identified effectively; (2) the complexity of the algorithm cannot be solved along with the rapid increase of the dimension of the sample set; (3) when multiple cluster boundaries are connected, the algorithm cannot effectively resolve them, and tends to classify them as a cluster.

Therefore, a new clustering algorithm needs to be proposed while effectively solving the above problems.

Disclosure of Invention

Aiming at the defects of the conventional clustering method and the clustering requirements of various specific data sets, the invention provides a clustering method based on adjacent grid search, which divides an original data set into a limited number of grid units (cell elements for short) by using an adaptive grid division method, establishes an adjacent grid operator and realizes the clustering analysis of the cell elements by using the adjacent grid operator, and can perform selective clustering optimization under specific conditions to improve the clustering quality.

The invention aims to provide a clustering method based on an adjacent grid search strategy, which comprises the following steps:

step 1, performing grid division on original data, namely dividing an original data set into a limited number of cell elements by using a multi-dimensional space grid, assuming that data points in the same cell element belong to the same cluster, realizing statistical analysis on the cell elements by setting three attributes of member, density and position of each cell element, detecting the grid by using a noise threshold value, judging whether noise data exist in the data set, and performing denoising processing;

step 2, carrying out grid clustering on the divided data: processing the denoised grid by utilizing a halo threshold value, and dividing the denoised grid into a halo cell and a core cell; establishing an adjacent grid operator for quickly searching for adjacent cells of a cell, realizing a clustering process through two steps of core cell clustering and halo cell division, dividing all core cells into a plurality of clusters through a traversal algorithm, and dividing halo cells into existing clusters based on cell distance;

step 3, performing clustering optimization according to data characteristics and user requirements: detecting the clusters through a merging threshold value, and judging whether clustering optimization is needed: if the cluster with the cell number smaller than the merging threshold exists, the cluster is determined to be a minimal class, and the minimal class is merged based on the inter-class distance; and distributing labels to the final clusters, further realizing label distribution of the cell elements in the clusters and the data points in the cell elements, and finally finishing the clustering process.

Process for partitioning an original data set using a multidimensional space grid, in particular by adapting the scale to the multidimensional space S ^dIs divided in any dimension, and an adaptive scale sequence is constructed

The following two cases are distinguished:

for a uniform grid space, an infinite monotonic series of scales is represented as follows:

for the density grid space, the following process is performed:

first, the resolution R is calculated:

wherein int (x) represents a forward rounding function, N is the sample set capacity, d is the sample set dimension, f _RIs a resolution factor.

Then, when the divided object is a multi-dimensional sample set D ^dIs smallest bounded space S ^DTime, scale sequence

Has a finite length; let the scale sequence expression be:

wherein d is the total dimension of the multidimensional space, i is any dimension of the multidimensional space, and R is the resolution.

Adjacent grid operator Aopt ^dIs represented as follows:

wherein the content of the first and second substances,

for a d-dimensional coordinate vector arranged in ascending order according to a symmetric ternary, T is 3 ^dT is the number of all grids in the adjacent grid operator, the symbol "-" represents the set difference, and 0 represents a d-dimensional zero vector.

The noise threshold threN is calculated by:

wherein f is _NAs noise coefficient, M ₁Is the total number of all non-empty cells. General rule f _NSet between 0 and 1, the more noise the sample set contains, the more f _NThe larger the value of (c).

The cell distance is any cell

The similarity distance between all its neighboring cells is represented as a vector:

wherein d is the total dimension number of the multidimensional space.

Compared with the traditional clustering method, the method has the following technical advantages:

according to the method, an original data set is divided into a limited number of grid units, an adjacent grid operator is established and utilized to realize clustering analysis of a cell element, and a new clustering method can be provided for rapidly-increased sample set dimensions; the cluster class having a complex boundary shape can be efficiently identified.

Drawings

FIG. 1 is a schematic diagram of a flow chart of a clustering method based on an adjacent grid search strategy according to the present invention;

fig. 2 is a schematic diagram of an example of network partition of a two-dimensional sample set, where (a) is a two-dimensional gaussian sample set, (b) is a cell number of a uniform grid space, (c) is partition in the uniform grid space, and (d) is partition in a density grid space.

Detailed Description

The technical solution of the present invention is described in detail below with reference to the accompanying drawings and examples.

Fig. 1 is a flowchart of a clustering method based on a neighbor grid search strategy according to the present invention. The method specifically comprises the following steps:

first, the original data is gridded, that is: dividing an original data set into a limited number of grid units (hereinafter referred to as cells) by using a multi-dimensional space grid, and assuming that data points in the same cell belong to the same cluster; the statistical analysis of the cell is realized by setting three attributes of the member, the density and the position of each cell; and detecting the grids by using a noise threshold value, judging whether noise data exist in the data set or not, and performing necessary denoising treatment.

The multidimensional space of the step is a d-dimensional space S formed by d orthogonal continuous dimensions ^dIs shown as follows

Wherein any dimension in the multi-dimensional space is continuous, infinite and uniform, and all dimensions are independent from each other; if at S ^dOptionally selected d of ₀(d ₀<d) The dimension is called S when it forms a multidimensional space ^dA subspace of

d ₀(d ₀<d) Representing any number of dimensions;

S ^da subset S of ^d′Is a pair of S ^dA bounded or semi-infinite space obtained by intercepting one or more dimensions of the object.

Multidimensional sample set D distributed in multidimensional space ^dExpressed as follows:

where N is the sample set capacity (i.e., the total number of samples in the sample set),

for the ith sample, expressed as:

and the following are corresponded to:

determining a multi-dimensional sample set D ^dAt S ^dIs bounded by one of the distributionsSubset S ^d′Then to S ^d′Is divided to establish a multi-dimensional grid space to achieve pair D ^dThe division of (2).

Corresponding multi-dimensional sample set D ^dThere is a minimum bounded subset S ^D：

Wherein the content of the first and second substances, is S ^DThe scope of the ith dimension of (a) is given by:

by an infinite monotonic sequence of scales on the multidimensional space S ^dIs divided, the scale sequence is expressed as follows:

divided dimension

The following are expressed by a plurality of sections of left-closed and right-open:

wherein u is _ijRepresenting a left closed and right open interval, the corner mark i represents the ith dimension, the corner mark j represents the jth interval, and the representation is as follows:

when the divided object is a multi-dimensional sample set D ^dIs smallest bounded space S ^DTime, scale sequence Having a finite length. Let the scale sequence expression be:

wherein R is resolution. Accordingly, the generated interval is represented as follows:

as can be seen,

to this end, the division of the minimum bounded space of the multi-dimensional sample set is established, which will be used in the following to establish the multi-dimensional grid space and realize the division of the multi-dimensional sample set.

For minimum bounded subset S ^DWhen dividing, each dimension adopts the same resolution. So the minimum bounded subset S ^DAfter division, a group with M ═ R is generated ^dGrid space of a cell.

Minimum bounded subset S for multidimensional bounded space ^DThe multidimensional grid space generated after the multidimensional space division is expressed as:

wherein the content of the first and second substances,

is a grid space G ^DOf the ith cell, each cell being G ^DIs a hyper-rectangle.

At the same time, the cell element

The following attribute values are also available:

wherein, location is the coordinate of the cell in the grid space, number is the member of the cell, i.e. the set of sample points existing in the range of the cell, and density is the density value of the cell, i.e. the number of sample points in the range of the cell.

For virtually infinite multidimensional space S ^dA corresponding multi-dimensional grid space G may also be established ^d。

For G ^DThe ith cell in

The coordinates are expressed as follows:

wherein, c _ijRepresentative cell Coordinates in the j-th dimension. The cell number i and the cell coordinate location have the following correspondence relationship:

when a multi-dimensional grid space is generated, (1) when a multi-dimensional sample set is divided, the larger the resolution R is, the more the obtained cell elements are, and meanwhile, the smaller the average density of the cell elements is, so that a proper resolution is favorable for improving the clustering effect; (2) since the distribution of sample points in space is usually not uniform, a large number of useless empty cells are generated, i.e. no sample points are contained in the cell, and obviously, such empty cells waste a large amount of storage space, and even make the calculation impossible, especially when processing high-dimensional sample sets or when the resolution is too large; (3) the size of each cell is determined by the interval generated at the cell by the dimension sequence, and the appropriate cell size is helpful to improve the clustering effect.

The grid division method based on the self-adaptive scale is suitable for the invention and mainly comprises the following steps:

(1) calculating the resolution R:

The average density of the cell is estimated by equation (11):

the above formula shows that the average density of the cells is not affected by the sample set capacity, so this resolution setting method is adaptive.

(2) Constructing a sequence of scales

The method of constructing the scale sequence is not exclusive, and two typical methods are given herein: one is uniform partitioning, i.e. equally spaced intervals are generated from the scale sequence, the multidimensional grid space thus generated is called uniform grid space; the other is density division, i.e. the interval generated by the scale sequence is related to the distribution density of the samples, and the interval is smaller where the samples are denser, the generated cells are more, and thus the generated multidimensional grid space is called as density grid space.

(3) Generating a grid space based on the multi-dimensional sample set:

in order to avoid the waste of storage space caused by the generation of a large number of empty cells, an algorithm for generating grid space according to a multi-dimensional sample set is designed, the grid space generated by the algorithm does not contain any empty cells, the calculation efficiency is greatly improved, the storage space is saved, and the attribute values of the cells are kept consistent with the original grid space. The pseudo-code for this algorithm is as follows:

taking a network of a two-dimensional sample set as an example, before clustering, the sample set is firstly divided by a uniform division method, sample points contained in cell elements with density smaller than a threshold value are removed, and then the denoised sample set is divided and clustered according to the uniform division method or the density division method. First generating an edge

And sample set D with dimensions in Gaussian distribution ²As shown in fig. 2 (a), when the resolution factor f _RAt 0.8, the value of the resolution R is 4 as can be seen from equation (11), and thus a 4 × 4 two-dimensional grid space is generated, and fig. 2 (b) shows the correspondence between the cell numbers and the cell coordinates; when using uniform partition method for D ²When the division is performed, the effect is shown in FIG. 2 (c), and D is divided by the density division method ²When the division is performed, the effect is as shown in fig. 2 (d). As shown in table 1, the two-dimensional sample set meshing results.

Denoising data: by setting a reasonable density threshold, the noise can be effectively filtered. In the present algorithm, a multi-dimensional sample set D ^dThe noise threshold threN of is calculated by:

in the formula (f) _NAs noise coefficient, M ₁Is the total number of all non-empty cells. General rule f _NSet between 0 and 1, the more noise the sample set contains, the more f _NThe larger the value of (c).

Then, performing grid clustering on the divided data, namely: processing the denoised grid by utilizing a halo threshold value, and dividing the denoised grid into a halo cell and a core cell; establishing an adjacent grid operator for rapidly searching for adjacent cells of a cell; the clustering process is realized through two steps of core cell clustering and halo cell division, all core cells are divided into a plurality of class clusters through a traversal algorithm, and halo cells are divided into existing class clusters based on cell distance. The specific operation is as follows:

the mesh after denoising is processed as follows:

(1) rearranging all cells according to the density from large to small;

(2) setting a halo threshold threH, defining cells with density less than the threshold as halo cells, and defining cells with density greater than or equal to the threshold as core cells, wherein the halo threshold is calculated by the following formula:

in the formula (f) _HAs a halo coefficient, M ₂The total number of the cells after denoising.

After the division, the parts connected with the boundaries among the clusters can be identified through the halo cells, so that the aim of isolating the core cells of each cluster is fulfilled. The larger the halo coefficient is, the more divided halo cells are, the larger the halo coefficient cannot be set to be, and the more divided halo cells are, and the more divided halo cells should be reasonably set according to the boundary connection degree of the sample set. If the clusters of classes in the sample set are boundary-connected, f _HTypically between 0.5 and 3.5; if the clusters of classes in the sample set are not boundary-connected, f _HIs set to 0.

The clustering process includes traversing the core cells and processing halo cells. In the first stage, the cells sorted according to the density are traversed in sequence to find the cores of all the clusters. The traversal process follows the following principles:

(1) each time a new cell is acquired, whether the cell belongs to the existing class cluster is judged, if not, the cell is defined as a new class cluster, otherwise, the next cell is processed;

(2) for a newly defined class cluster, first find the adjacent cells of the cell creating this class cluster in the core cell and classify them into the class cluster; cells within the class cluster are then cycled through and the neighboring cells of the core cells that belong to them are classified into the class cluster until no new cells are classified in.

In the second stage, each halo cell is classified into a cluster class in which its nearest neighbor cell is located according to the nearest principle. To measure the distance between cells, we define the gravity centers of all data points in a cell and use the distance between the gravity centers of the cells as a measure of the distance between cells. For any cell

The center of gravity is calculated as follows:

wherein n is a cell

Number of samples in.

For a halo cell, the following two cases are possible: the presence of adjacent cells and the absence of adjacent cells. When there is an adjacent cell, the distance between it and the adjacent cell is calculated separately, and any two cells

And the similarity distance therebetween can be calculated by the following formula:

and finding the adjacent cell nearest to it, setting a certain cell

Has the coordinates of Then G is ^DNeutralization of

Non-adjacent cell with minimum similarity distance between them

Satisfies the following expression:

wherein j is an integer no greater than d, thus

And

the similarity distance between them is 2.

And a certain cell

The coordinates of the adjacent cells are expressed as follows:

wherein the content of the first and second substances, is composed of

Of adjacent cells, also called adjacent grids, Aopt ^dIs a d-dimensional adjacency operator:

wherein the content of the first and second substances,

for a d-dimensional coordinate vector arranged in ascending order according to a symmetric ternary, T is 3 ^dT represents the number of all grids in the adjacent grid operator, T-1 represents the number of active grids in the adjacent grid operator, the symbol "-" represents the set difference, and 0 represents a d-dimensional zero vector.

Taking a two-dimensional grid space as an example, the adjacent operator expression is as follows:

Aopt ²＝{<-1,1>,<-1,0>,<-1,1>,<0,-1>,<0,1>,<1,-1>,<1,0>,<1,1>} (24)

while

determining the cluster to which the nearest adjacent cell belongs, and classifying the halo cell into the cluster. If a halo cell does not have an adjacent cell, the halo cell is defined as a new cluster. And finally, after all the halo cells are processed, the whole clustering process is realized.

When a sample set has a higher dimension, the similarity distance between a cell and most of its neighboring cells does not meet the clustering requirement. Therefore, when clustering a high-dimensional sample set, the similarity distance vectors Adistance of adjacent cells should be truncated, i.e. only the cells corresponding to the first 4 similarity distances of Adistance in the adjacent cells are considered.

And finally, performing cluster optimization according to data characteristics and user requirements: detecting the clusters through a merging threshold value, and judging whether clustering optimization is needed or not; if the cluster with the cell number smaller than the merging threshold exists, the cluster is determined to be a minimal class, and the minimal class is merged based on the inter-class distance; and distributing labels to the final clusters, further realizing label distribution of the cell elements in the clusters and the data points in the cell elements, and finally finishing the clustering process. The clustering method used by the invention can be a k-means clustering method, a DBSCAN clustering method, a CFSFDP clustering method, a WaveCluster clustering method, a CAGS clustering method and the like.

Claims

1. A clustering method based on an adjacent grid search strategy is characterized by comprising the following steps:

2. The method of claim 1, wherein the partitioning of the original data set using the multi-dimensional space grid is performed by adaptive scaling of the multi-dimensional space S ^dIs divided in any dimension, and an adaptive scale sequence is constructed

The following two cases are distinguished:

for the density grid space, the following process is performed:

first, the resolution R is calculated:

wherein int (x) represents a forward rounding function, N is the sample set capacity, d is the sample set dimension, f _RIs a resolution factor;

Has a finite length; let the scale sequence expression be:

3. The adjacent grid search strategy-based clustering method of claim 1, wherein the adjacent grid operator Aopt ^dIs represented as follows:

wherein the content of the first and second substances,

4. The adjacent grid search strategy-based clustering method according to claim 1, wherein the noise threshold threN is calculated by the following formula:

wherein f is _NAs noise coefficient, M ₁The more noise that is contained in the sample set for the total number of all non-empty cells, then f _NThe larger the value of (c).

5. The method of claim 1, wherein the cell distance is any cell

wherein d is the total dimension number of the multidimensional space.