CN110781943A - Clustering method based on adjacent grid search - Google Patents

Clustering method based on adjacent grid search Download PDF

Info

Publication number
CN110781943A
CN110781943A CN201910997760.7A CN201910997760A CN110781943A CN 110781943 A CN110781943 A CN 110781943A CN 201910997760 A CN201910997760 A CN 201910997760A CN 110781943 A CN110781943 A CN 110781943A
Authority
CN
China
Prior art keywords
cell
grid
clustering
cells
halo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910997760.7A
Other languages
Chinese (zh)
Inventor
李志猛
王国锋
赵坚
黄钦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Chengjian University
Original Assignee
Tianjin Chengjian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Chengjian University filed Critical Tianjin Chengjian University
Priority to CN201910997760.7A priority Critical patent/CN110781943A/en
Publication of CN110781943A publication Critical patent/CN110781943A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Abstract

The invention discloses a clustering method based on adjacent grid search strategy, which comprises the following steps of firstly, carrying out grid division on original data: dividing an original data set into a limited number of cells by using a multi-dimensional space grid, and performing denoising processing when necessary; then, carrying out grid clustering on the divided data: processing the denoised grid by utilizing a halo threshold value, and dividing the denoised grid into a halo cell and a core cell; establishing an adjacent grid operator for rapidly searching for adjacent cells of a cell; the clustering process is realized through two steps of core cell clustering and halo cell division, all core cells are divided into a plurality of class clusters through a traversal algorithm, and halo cells are divided into existing class clusters based on cell distance; and finally, carrying out clustering optimization according to the data characteristics and the user requirements. Compared with the prior art, the invention can provide a new clustering method aiming at the dimension of the rapidly-increased sample set, and can effectively identify the cluster with the complex boundary shape.

Description

Clustering method based on adjacent grid search
Technical Field
The invention relates to the technical field of unsupervised pattern recognition and data mining, in particular to a clustering method based on grids.
Background
With the development of big data and network technology, a great amount of data surplus appears in various disciplines and fields, so that cluster analysis becomes an increasingly important technology. The process of dividing a collection of physical or abstract objects into classes composed of similar objects is called clustering. The cluster generated by clustering is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters. With the application of clustering in various fields, higher robustness requirements are provided for a clustering algorithm. The following specific data sets are of increasing interest in many applications: (1) a noisy data set; (2) a large-scale dataset; (3) a high-dimensional dataset; (4) a data set having arbitrarily shaped clusters; (5) data sets with large density difference among classes; (6) highly overlapping datasets between classes.
At present, the mainstream clustering method mainly includes: a partition-based clustering method, a density-based clustering method, a hierarchy-based clustering method, a grid-based clustering method, and the like. The clustering method based on division can only find out the super-spherical clusters, the clustering method based on density is difficult to process data sets with higher noise level and high dimension, and the clustering method based on hierarchy has poor capability of processing noise data and data overlapped between classes. In addition, the above clustering methods are all implemented by directly processing data points, so when large-scale data sets are clustered, the operation time of the algorithm is remarkably increased, and the method is hardly suitable for engineering practice. Compared with the method, the data set is divided into a plurality of grid units by the grid-based clustering method, the number of the grid units is far smaller than the number of sample points in the data set, so that the running time of the algorithm is greatly improved, and the algorithm is not influenced by the number of the sample points. Different grid clustering algorithms are different in the processing method of the divided grids, and the STING adopts a top-down query method, namely, firstly, a query condition is set, a certain layer of grids meeting the query condition are returned from a certain layer, while the grids which do not meet the query condition are deleted and are not considered, the returned grids are extended to the next layer to continue to execute query operation, and the steps are repeated until the bottom layer. It can be seen that the process is a process of continuously deleting grids which do not meet the query condition, and if a cluster with a complex boundary is encountered, the shape of the boundary cannot be identified due to low resolution of the upper grid, so that a false deletion phenomenon occurs, and a clustering result is distorted. The WaveCluster considers the distribution density of data along a certain direction as a plurality of one-dimensional signals, and separates regions with frequent density change through wavelet decomposition, thereby completing the detection of the class cluster boundary and achieving the purpose of clustering. It is easy to see that the number of one-dimensional signals in the algorithm increases exponentially with the sample set dimension, so when the dimension is larger, the algorithm cannot be performed. CLIQUE is a method for finding all connected dense units in a greedy growing mode along each dimension from one dense unit to finally form a class cluster. This method can only find spherical clusters, for non-spherical clusters, the algorithm tends to break it down into a number of connected spherical clusters. The OptiGrid focuses on how to construct the optimal partition of the multi-dimensional sample set, and as for the partitioned grid space, the algorithm is simple and considers that some independent dense grids are the class clusters. The clustering effect of this method depends heavily on the projection algorithm and the meshing results produced by it. Through the above analysis, the current grid clustering algorithm has the following problems in grid processing: (1) clusters with complex boundary shapes cannot be identified effectively; (2) the complexity of the algorithm cannot be solved along with the rapid increase of the dimension of the sample set; (3) when multiple cluster boundaries are connected, the algorithm cannot effectively resolve them, and tends to classify them as a cluster.
Therefore, a new clustering algorithm needs to be proposed while effectively solving the above problems.
Disclosure of Invention
Aiming at the defects of the conventional clustering method and the clustering requirements of various specific data sets, the invention provides a clustering method based on adjacent grid search, which divides an original data set into a limited number of grid units (cell elements for short) by using an adaptive grid division method, establishes an adjacent grid operator and realizes the clustering analysis of the cell elements by using the adjacent grid operator, and can perform selective clustering optimization under specific conditions to improve the clustering quality.
The invention aims to provide a clustering method based on an adjacent grid search strategy, which comprises the following steps:
step 1, performing grid division on original data, namely dividing an original data set into a limited number of cell elements by using a multi-dimensional space grid, assuming that data points in the same cell element belong to the same cluster, realizing statistical analysis on the cell elements by setting three attributes of member, density and position of each cell element, detecting the grid by using a noise threshold value, judging whether noise data exist in the data set, and performing denoising processing;
step 2, carrying out grid clustering on the divided data: processing the denoised grid by utilizing a halo threshold value, and dividing the denoised grid into a halo cell and a core cell; establishing an adjacent grid operator for quickly searching for adjacent cells of a cell, realizing a clustering process through two steps of core cell clustering and halo cell division, dividing all core cells into a plurality of clusters through a traversal algorithm, and dividing halo cells into existing clusters based on cell distance;
step 3, performing clustering optimization according to data characteristics and user requirements: detecting the clusters through a merging threshold value, and judging whether clustering optimization is needed: if the cluster with the cell number smaller than the merging threshold exists, the cluster is determined to be a minimal class, and the minimal class is merged based on the inter-class distance; and distributing labels to the final clusters, further realizing label distribution of the cell elements in the clusters and the data points in the cell elements, and finally finishing the clustering process.
Process for partitioning an original data set using a multidimensional space grid, in particular by adapting the scale to the multidimensional space S dIs divided in any dimension, and an adaptive scale sequence is constructed
Figure BDA0002240273900000031
The following two cases are distinguished:
for a uniform grid space, an infinite monotonic series of scales is represented as follows:
Figure BDA0002240273900000032
for the density grid space, the following process is performed:
first, the resolution R is calculated:
wherein int (x) represents a forward rounding function, N is the sample set capacity, d is the sample set dimension, f RIs a resolution factor.
Then, when the divided object is a multi-dimensional sample set D dIs smallest bounded space S DTime, scale sequence
Figure BDA0002240273900000034
Has a finite length; let the scale sequence expression be:
Figure BDA0002240273900000035
wherein d is the total dimension of the multidimensional space, i is any dimension of the multidimensional space, and R is the resolution.
Adjacent grid operator Aopt dIs represented as follows:
Figure BDA0002240273900000041
wherein the content of the first and second substances,
Figure BDA0002240273900000042
for a d-dimensional coordinate vector arranged in ascending order according to a symmetric ternary, T is 3 dT is the number of all grids in the adjacent grid operator, the symbol "-" represents the set difference, and 0 represents a d-dimensional zero vector.
The noise threshold threN is calculated by:
Figure BDA0002240273900000043
wherein f is NAs noise coefficient, M 1Is the total number of all non-empty cells. General rule f NSet between 0 and 1, the more noise the sample set contains, the more f NThe larger the value of (c).
The cell distance is any cell
Figure BDA0002240273900000044
The similarity distance between all its neighboring cells is represented as a vector:
wherein d is the total dimension number of the multidimensional space.
Compared with the traditional clustering method, the method has the following technical advantages:
according to the method, an original data set is divided into a limited number of grid units, an adjacent grid operator is established and utilized to realize clustering analysis of a cell element, and a new clustering method can be provided for rapidly-increased sample set dimensions; the cluster class having a complex boundary shape can be efficiently identified.
Drawings
FIG. 1 is a schematic diagram of a flow chart of a clustering method based on an adjacent grid search strategy according to the present invention;
fig. 2 is a schematic diagram of an example of network partition of a two-dimensional sample set, where (a) is a two-dimensional gaussian sample set, (b) is a cell number of a uniform grid space, (c) is partition in the uniform grid space, and (d) is partition in a density grid space.
Detailed Description
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and examples.
Fig. 1 is a flowchart of a clustering method based on a neighbor grid search strategy according to the present invention. The method specifically comprises the following steps:
first, the original data is gridded, that is: dividing an original data set into a limited number of grid units (hereinafter referred to as cells) by using a multi-dimensional space grid, and assuming that data points in the same cell belong to the same cluster; the statistical analysis of the cell is realized by setting three attributes of the member, the density and the position of each cell; and detecting the grids by using a noise threshold value, judging whether noise data exist in the data set or not, and performing necessary denoising treatment.
The multidimensional space of the step is a d-dimensional space S formed by d orthogonal continuous dimensions dIs shown as follows
Figure BDA0002240273900000051
Wherein any dimension in the multi-dimensional space is continuous, infinite and uniform, and all dimensions are independent from each other; if at S dOptionally selected d of 0(d 0<d) The dimension is called S when it forms a multidimensional space dA subspace of
Figure BDA0002240273900000052
d 0(d 0<d) Representing any number of dimensions;
S da subset S of d′Is a pair of S dA bounded or semi-infinite space obtained by intercepting one or more dimensions of the object.
Multidimensional sample set D distributed in multidimensional space dExpressed as follows:
Figure BDA0002240273900000053
where N is the sample set capacity (i.e., the total number of samples in the sample set),
Figure BDA0002240273900000054
for the ith sample, expressed as:
and the following are corresponded to:
Figure BDA0002240273900000056
determining a multi-dimensional sample set D dAt S dIs bounded by one of the distributionsSubset S d′Then to S d′Is divided to establish a multi-dimensional grid space to achieve pair D dThe division of (2).
Corresponding multi-dimensional sample set D dThere is a minimum bounded subset S D
Figure BDA0002240273900000061
Wherein the content of the first and second substances, is S DThe scope of the ith dimension of (a) is given by:
Figure BDA0002240273900000063
by an infinite monotonic sequence of scales on the multidimensional space S dIs divided, the scale sequence is expressed as follows:
Figure BDA0002240273900000064
divided dimension
Figure BDA0002240273900000065
The following are expressed by a plurality of sections of left-closed and right-open:
Figure BDA0002240273900000066
wherein u is ijRepresenting a left closed and right open interval, the corner mark i represents the ith dimension, the corner mark j represents the jth interval, and the representation is as follows:
Figure BDA0002240273900000067
when the divided object is a multi-dimensional sample set D dIs smallest bounded space S DTime, scale sequence Having a finite length. Let the scale sequence expression be:
Figure BDA0002240273900000069
wherein R is resolution. Accordingly, the generated interval is represented as follows:
Figure BDA00022402739000000610
as can be seen,
Figure BDA00022402739000000611
to this end, the division of the minimum bounded space of the multi-dimensional sample set is established, which will be used in the following to establish the multi-dimensional grid space and realize the division of the multi-dimensional sample set.
For minimum bounded subset S DWhen dividing, each dimension adopts the same resolution. So the minimum bounded subset S DAfter division, a group with M ═ R is generated dGrid space of a cell.
Minimum bounded subset S for multidimensional bounded space DThe multidimensional grid space generated after the multidimensional space division is expressed as:
Figure BDA0002240273900000071
wherein the content of the first and second substances,
Figure BDA0002240273900000072
is a grid space G DOf the ith cell, each cell being G DIs a hyper-rectangle.
At the same time, the cell element
Figure BDA0002240273900000073
The following attribute values are also available:
Figure BDA0002240273900000074
wherein, location is the coordinate of the cell in the grid space, number is the member of the cell, i.e. the set of sample points existing in the range of the cell, and density is the density value of the cell, i.e. the number of sample points in the range of the cell.
For virtually infinite multidimensional space S dA corresponding multi-dimensional grid space G may also be established d
For G DThe ith cell in
Figure BDA0002240273900000075
The coordinates are expressed as follows:
Figure BDA0002240273900000076
wherein, c ijRepresentative cell Coordinates in the j-th dimension. The cell number i and the cell coordinate location have the following correspondence relationship:
Figure BDA0002240273900000078
when a multi-dimensional grid space is generated, (1) when a multi-dimensional sample set is divided, the larger the resolution R is, the more the obtained cell elements are, and meanwhile, the smaller the average density of the cell elements is, so that a proper resolution is favorable for improving the clustering effect; (2) since the distribution of sample points in space is usually not uniform, a large number of useless empty cells are generated, i.e. no sample points are contained in the cell, and obviously, such empty cells waste a large amount of storage space, and even make the calculation impossible, especially when processing high-dimensional sample sets or when the resolution is too large; (3) the size of each cell is determined by the interval generated at the cell by the dimension sequence, and the appropriate cell size is helpful to improve the clustering effect.
The grid division method based on the self-adaptive scale is suitable for the invention and mainly comprises the following steps:
(1) calculating the resolution R:
Figure BDA0002240273900000081
wherein int (x) represents a forward rounding function, N is the sample set capacity, d is the sample set dimension, f RIs a resolution factor.
The average density of the cell is estimated by equation (11):
Figure BDA0002240273900000082
the above formula shows that the average density of the cells is not affected by the sample set capacity, so this resolution setting method is adaptive.
(2) Constructing a sequence of scales
Figure BDA0002240273900000083
The method of constructing the scale sequence is not exclusive, and two typical methods are given herein: one is uniform partitioning, i.e. equally spaced intervals are generated from the scale sequence, the multidimensional grid space thus generated is called uniform grid space; the other is density division, i.e. the interval generated by the scale sequence is related to the distribution density of the samples, and the interval is smaller where the samples are denser, the generated cells are more, and thus the generated multidimensional grid space is called as density grid space.
(3) Generating a grid space based on the multi-dimensional sample set:
in order to avoid the waste of storage space caused by the generation of a large number of empty cells, an algorithm for generating grid space according to a multi-dimensional sample set is designed, the grid space generated by the algorithm does not contain any empty cells, the calculation efficiency is greatly improved, the storage space is saved, and the attribute values of the cells are kept consistent with the original grid space. The pseudo-code for this algorithm is as follows:
Figure BDA0002240273900000084
Figure BDA0002240273900000091
taking a network of a two-dimensional sample set as an example, before clustering, the sample set is firstly divided by a uniform division method, sample points contained in cell elements with density smaller than a threshold value are removed, and then the denoised sample set is divided and clustered according to the uniform division method or the density division method. First generating an edge
Figure BDA0002240273900000092
And sample set D with dimensions in Gaussian distribution 2As shown in fig. 2 (a), when the resolution factor f RAt 0.8, the value of the resolution R is 4 as can be seen from equation (11), and thus a 4 × 4 two-dimensional grid space is generated, and fig. 2 (b) shows the correspondence between the cell numbers and the cell coordinates; when using uniform partition method for D 2When the division is performed, the effect is shown in FIG. 2 (c), and D is divided by the density division method 2When the division is performed, the effect is as shown in fig. 2 (d). As shown in table 1, the two-dimensional sample set meshing results.
Figure BDA0002240273900000094
Figure BDA0002240273900000101
Denoising data: by setting a reasonable density threshold, the noise can be effectively filtered. In the present algorithm, a multi-dimensional sample set D dThe noise threshold threN of is calculated by:
in the formula (f) NAs noise coefficient, M 1Is the total number of all non-empty cells. General rule f NSet between 0 and 1, the more noise the sample set contains, the more f NThe larger the value of (c).
Then, performing grid clustering on the divided data, namely: processing the denoised grid by utilizing a halo threshold value, and dividing the denoised grid into a halo cell and a core cell; establishing an adjacent grid operator for rapidly searching for adjacent cells of a cell; the clustering process is realized through two steps of core cell clustering and halo cell division, all core cells are divided into a plurality of class clusters through a traversal algorithm, and halo cells are divided into existing class clusters based on cell distance. The specific operation is as follows:
the mesh after denoising is processed as follows:
(1) rearranging all cells according to the density from large to small;
(2) setting a halo threshold threH, defining cells with density less than the threshold as halo cells, and defining cells with density greater than or equal to the threshold as core cells, wherein the halo threshold is calculated by the following formula:
Figure BDA0002240273900000103
in the formula (f) HAs a halo coefficient, M 2The total number of the cells after denoising.
After the division, the parts connected with the boundaries among the clusters can be identified through the halo cells, so that the aim of isolating the core cells of each cluster is fulfilled. The larger the halo coefficient is, the more divided halo cells are, the larger the halo coefficient cannot be set to be, and the more divided halo cells are, and the more divided halo cells should be reasonably set according to the boundary connection degree of the sample set. If the clusters of classes in the sample set are boundary-connected, f HTypically between 0.5 and 3.5; if the clusters of classes in the sample set are not boundary-connected, f HIs set to 0.
The clustering process includes traversing the core cells and processing halo cells. In the first stage, the cells sorted according to the density are traversed in sequence to find the cores of all the clusters. The traversal process follows the following principles:
(1) each time a new cell is acquired, whether the cell belongs to the existing class cluster is judged, if not, the cell is defined as a new class cluster, otherwise, the next cell is processed;
(2) for a newly defined class cluster, first find the adjacent cells of the cell creating this class cluster in the core cell and classify them into the class cluster; cells within the class cluster are then cycled through and the neighboring cells of the core cells that belong to them are classified into the class cluster until no new cells are classified in.
In the second stage, each halo cell is classified into a cluster class in which its nearest neighbor cell is located according to the nearest principle. To measure the distance between cells, we define the gravity centers of all data points in a cell and use the distance between the gravity centers of the cells as a measure of the distance between cells. For any cell
Figure BDA0002240273900000111
The center of gravity is calculated as follows:
Figure BDA0002240273900000112
wherein n is a cell
Figure BDA0002240273900000113
Number of samples in.
For a halo cell, the following two cases are possible: the presence of adjacent cells and the absence of adjacent cells. When there is an adjacent cell, the distance between it and the adjacent cell is calculated separately, and any two cells
Figure BDA0002240273900000114
And the similarity distance therebetween can be calculated by the following formula:
Figure BDA0002240273900000116
and finding the adjacent cell nearest to it, setting a certain cell
Figure BDA0002240273900000117
Has the coordinates of Then G is DNeutralization of
Figure BDA0002240273900000119
Non-adjacent cell with minimum similarity distance between them
Figure BDA00022402739000001110
Satisfies the following expression:
Figure BDA0002240273900000121
wherein j is an integer no greater than d, thus
Figure BDA0002240273900000122
And
Figure BDA0002240273900000123
the similarity distance between them is 2.
And a certain cell
Figure BDA0002240273900000124
The coordinates of the adjacent cells are expressed as follows:
Figure BDA0002240273900000125
wherein the content of the first and second substances, is composed of
Figure BDA0002240273900000127
Of adjacent cells, also called adjacent grids, Aopt dIs a d-dimensional adjacency operator:
Figure BDA0002240273900000128
wherein the content of the first and second substances,
Figure BDA0002240273900000129
for a d-dimensional coordinate vector arranged in ascending order according to a symmetric ternary, T is 3 dT represents the number of all grids in the adjacent grid operator, T-1 represents the number of active grids in the adjacent grid operator, the symbol "-" represents the set difference, and 0 represents a d-dimensional zero vector.
Taking a two-dimensional grid space as an example, the adjacent operator expression is as follows:
Aopt 2={<-1,1>,<-1,0>,<-1,1>,<0,-1>,<0,1>,<1,-1>,<1,0>,<1,1>} (24)
while
Figure BDA00022402739000001210
The similarity distance between all its neighboring cells is represented as a vector:
Figure BDA00022402739000001211
determining the cluster to which the nearest adjacent cell belongs, and classifying the halo cell into the cluster. If a halo cell does not have an adjacent cell, the halo cell is defined as a new cluster. And finally, after all the halo cells are processed, the whole clustering process is realized.
When a sample set has a higher dimension, the similarity distance between a cell and most of its neighboring cells does not meet the clustering requirement. Therefore, when clustering a high-dimensional sample set, the similarity distance vectors Adistance of adjacent cells should be truncated, i.e. only the cells corresponding to the first 4 similarity distances of Adistance in the adjacent cells are considered.
And finally, performing cluster optimization according to data characteristics and user requirements: detecting the clusters through a merging threshold value, and judging whether clustering optimization is needed or not; if the cluster with the cell number smaller than the merging threshold exists, the cluster is determined to be a minimal class, and the minimal class is merged based on the inter-class distance; and distributing labels to the final clusters, further realizing label distribution of the cell elements in the clusters and the data points in the cell elements, and finally finishing the clustering process. The clustering method used by the invention can be a k-means clustering method, a DBSCAN clustering method, a CFSFDP clustering method, a WaveCluster clustering method, a CAGS clustering method and the like.

Claims (5)

1. A clustering method based on an adjacent grid search strategy is characterized by comprising the following steps:
step 1, performing grid division on original data, namely dividing an original data set into a limited number of cell elements by using a multi-dimensional space grid, assuming that data points in the same cell element belong to the same cluster, realizing statistical analysis on the cell elements by setting three attributes of member, density and position of each cell element, detecting the grid by using a noise threshold value, judging whether noise data exist in the data set, and performing denoising processing;
step 2, carrying out grid clustering on the divided data: processing the denoised grid by utilizing a halo threshold value, and dividing the denoised grid into a halo cell and a core cell; establishing an adjacent grid operator for quickly searching for adjacent cells of a cell, realizing a clustering process through two steps of core cell clustering and halo cell division, dividing all core cells into a plurality of clusters through a traversal algorithm, and dividing halo cells into existing clusters based on cell distance;
step 3, performing clustering optimization according to data characteristics and user requirements: detecting the clusters through a merging threshold value, and judging whether clustering optimization is needed: if the cluster with the cell number smaller than the merging threshold exists, the cluster is determined to be a minimal class, and the minimal class is merged based on the inter-class distance; and distributing labels to the final clusters, further realizing label distribution of the cell elements in the clusters and the data points in the cell elements, and finally finishing the clustering process.
2. The method of claim 1, wherein the partitioning of the original data set using the multi-dimensional space grid is performed by adaptive scaling of the multi-dimensional space S dIs divided in any dimension, and an adaptive scale sequence is constructed
Figure FDA0002240273890000011
The following two cases are distinguished:
for a uniform grid space, an infinite monotonic series of scales is represented as follows:
Figure FDA0002240273890000012
for the density grid space, the following process is performed:
first, the resolution R is calculated:
Figure FDA0002240273890000013
wherein int (x) represents a forward rounding function, N is the sample set capacity, d is the sample set dimension, f RIs a resolution factor;
then, when the divided object is a multi-dimensional sample set D dIs smallest bounded space S DTime, scale sequence
Figure FDA0002240273890000021
Has a finite length; let the scale sequence expression be:
Figure FDA0002240273890000022
wherein d is the total dimension of the multidimensional space, i is any dimension of the multidimensional space, and R is the resolution.
3. The adjacent grid search strategy-based clustering method of claim 1, wherein the adjacent grid operator Aopt dIs represented as follows:
Figure FDA0002240273890000023
wherein the content of the first and second substances,
Figure FDA0002240273890000024
for a d-dimensional coordinate vector arranged in ascending order according to a symmetric ternary, T is 3 dT is the number of all grids in the adjacent grid operator, the symbol "-" represents the set difference, and 0 represents a d-dimensional zero vector.
4. The adjacent grid search strategy-based clustering method according to claim 1, wherein the noise threshold threN is calculated by the following formula:
Figure FDA0002240273890000025
wherein f is NAs noise coefficient, M 1The more noise that is contained in the sample set for the total number of all non-empty cells, then f NThe larger the value of (c).
5. The method of claim 1, wherein the cell distance is any cell
Figure FDA0002240273890000026
The similarity distance between all its neighboring cells is represented as a vector:
Figure FDA0002240273890000027
wherein d is the total dimension number of the multidimensional space.
CN201910997760.7A 2019-10-21 2019-10-21 Clustering method based on adjacent grid search Pending CN110781943A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910997760.7A CN110781943A (en) 2019-10-21 2019-10-21 Clustering method based on adjacent grid search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910997760.7A CN110781943A (en) 2019-10-21 2019-10-21 Clustering method based on adjacent grid search

Publications (1)

Publication Number Publication Date
CN110781943A true CN110781943A (en) 2020-02-11

Family

ID=69386053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910997760.7A Pending CN110781943A (en) 2019-10-21 2019-10-21 Clustering method based on adjacent grid search

Country Status (1)

Country Link
CN (1) CN110781943A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131937A (en) * 2020-08-14 2020-12-25 中翰盛泰生物技术股份有限公司 Automatic grouping method of fluorescent microspheres
CN113361411A (en) * 2021-06-07 2021-09-07 国网新疆电力有限公司哈密供电公司 Random pulse interference signal elimination method based on grid and density clustering algorithm
CN115795520A (en) * 2023-02-07 2023-03-14 济南霍兹信息科技有限公司 Data management method for computer system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131937A (en) * 2020-08-14 2020-12-25 中翰盛泰生物技术股份有限公司 Automatic grouping method of fluorescent microspheres
CN113361411A (en) * 2021-06-07 2021-09-07 国网新疆电力有限公司哈密供电公司 Random pulse interference signal elimination method based on grid and density clustering algorithm
CN115795520A (en) * 2023-02-07 2023-03-14 济南霍兹信息科技有限公司 Data management method for computer system
CN115795520B (en) * 2023-02-07 2023-04-21 济南霍兹信息科技有限公司 Data management method for computer system

Similar Documents

Publication Publication Date Title
KR101003842B1 (en) Method and system of clustering for multi-dimensional data streams
Ozkok et al. International Journal of Intelligent Systems and Applications in Engineering
CN110781943A (en) Clustering method based on adjacent grid search
CN106845536B (en) Parallel clustering method based on image scaling
Li et al. Local gap density for clustering high-dimensional data with varying densities
Ashabi et al. The systematic review of K-means clustering algorithm
Abbas et al. Cmune: A clustering using mutual nearest neighbors algorithm
CN115294378A (en) Image clustering method and system
Starczewski et al. A novel grid-based clustering algorithm
Mu et al. DBSCAN-KNN-GA: a multi Density-Level Parameter-Free clustering algorithm
Hershberger et al. Adaptive sampling for geometric problems over data streams
Cheng et al. A local cores-based hierarchical clustering algorithm for data sets with complex structures
Tsai et al. GF-DBSCAN; a new efficient and effective data clustering technique for large databases
Wang et al. Robust clustering with topological graph partition
Zheng Improved K-means clustering algorithm based on dynamic clustering
Yan et al. Density-based Clustering using Automatic Density Peak Detection.
CN115510959A (en) Density peak value clustering method based on natural nearest neighbor and multi-cluster combination
CN114359632A (en) Point cloud target classification method based on improved PointNet + + neural network
Ali et al. Subject review: text clustering algorithms
Zhang et al. A new outlier detection algorithm based on fast density peak clustering outlier factor.
Lu et al. Dynamic Partition Forest: An Efficient and Distributed Indexing Scheme for Similarity Search based on Hashing
Li et al. An efficient clustering method for dbscan geographic spatio-temporal large data with improved parameter optimization
Wang et al. CUBN: A clustering algorithm based on density and distance
CN111062418A (en) Non-parametric clustering algorithm and system based on minimum spanning tree
Luan et al. Density peaks spatial clustering by grid neighborhood search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200211