CN117113117A - Density peak clustering method for self-adaptive scale grid and diffusion intensity - Google Patents

Density peak clustering method for self-adaptive scale grid and diffusion intensity Download PDF

Info

Publication number
CN117113117A
CN117113117A CN202311166189.7A CN202311166189A CN117113117A CN 117113117 A CN117113117 A CN 117113117A CN 202311166189 A CN202311166189 A CN 202311166189A CN 117113117 A CN117113117 A CN 117113117A
Authority
CN
China
Prior art keywords
grid
cluster
density
grids
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311166189.7A
Other languages
Chinese (zh)
Inventor
王玥洋
佘堃
刘书舟
于钥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202311166189.7A priority Critical patent/CN117113117A/en
Publication of CN117113117A publication Critical patent/CN117113117A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Abstract

The invention relates to the technical field of data aggregation, in particular to a density peak clustering method of self-adaptive scale grids and diffusion strength; mapping the original data points into a grid space according to the self-adaptive scale grid division, dividing the grid into a dense grid and a sparse grid according to the grid density threshold division, and mapping the data points into a grid space T g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step; dividing grids according to density fluctuation and calculating connectivity; calculating relative distance and diffusion intensity, screening the assumed center anddividing an initial cluster; searching density peaks of inter-cluster edge grids; combining multiple clusters; mapping the grid clustering result to the original data set; by the mode, the effect of processing the data clustering with higher dimensionality can be obtained.

Description

Density peak clustering method for self-adaptive scale grid and diffusion intensity
Technical Field
The invention relates to the technical field of data aggregation, in particular to a density peak clustering method of self-adaptive scale grids and diffusion intensity.
Background
The cluster analysis is an unsupervised learning method, and aims to divide data into different clusters according to specific standards, explore implicit information of the data, and is widely applied to the fields of data analysis, image processing, bioinformatics, pattern recognition, machine learning and the like as a data analysis method; at present, in the data set clustering process, a density peak clustering algorithm, namely a DPC algorithm is adopted, but when the data set clustering is carried out by adopting the density peak clustering algorithm, the data clustering with higher dimensionality cannot be processed.
Disclosure of Invention
The invention aims to provide a density peak clustering method of self-adaptive scale grids and diffusion intensity, which can obtain the effect of processing data clustering with higher dimensionality.
In order to achieve the above purpose, the density peak clustering method of the self-adaptive scale grid and the diffusion intensity adopted by the invention comprises the following steps:
step 1, mapping original data points into a grid space according to adaptive scale grid division, dividing the grid into a dense grid and a sparse grid according to grid density threshold division, and mapping the data points into a grid space T g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step;
step 2, dividing grids according to density fluctuation and calculating connectivity;
step 3, calculating the relative distance and the diffusion intensity, screening the assumed center and dividing the initial cluster;
step 4, searching density peaks of inter-cluster edge grids;
step 5, multi-cluster merging;
and 6, mapping the grid clustering result to the original data set.
Wherein, in step 2, in the step of dividing the grids according to the density fluctuation and calculating the connectivity:
will ρ g < θ is recorded as sparse network, otherwise is recorded as dense grids, and connectivity h of each dense grid is calculated g And a global communication branch number ω.
Wherein, in step 3, the relative distance and the diffusion intensity are calculated, and in the step of screening the assumed center and dividing the initial clusters:
computing a mesh ob using chebyshev distances in d-dimensional space i Is a relative distance delta of (2) i Wherein ob ip Is the p-dimensional coordinates of the grid:
after calculating the centrality of each grid, arranging the centrality in a descending order and starting traversing searching, if delta is around grid i i Within a radius range of gamma i >γ j (j+.i), the grid is the hypothetical center; dividing the non-central grids into clusters represented by the center according to grids which are nearest neighbors and have the highest gamma values, and executing the step until all the rest grids are divided.
In step 4, in the step of searching the density peak value of the inter-cluster edge grid, the edge grid among the clusters is searched according to the following conditions:
find grid i in cluster c, at its delta i There are grids j belonging to other clusters c' within range;
i is the nearest grid to the j grid in cluster c;
the edge grid density peak between a pair of clusters c and c' is denoted as ρ cc′
Wherein, in step 5, in the step of multi-cluster merging:
judging whether the clusters need to be combined or not through density peaks of the inter-cluster edge grids, and if the density peaks of the inter-cluster edge grids are higher, indicating that the similarity of the two clusters is higher; if the density peak of the cluster c where the grid is located is within a certain density fluctuation range, the cluster should be considered to be merged into the adjacent cluster c', that is, the following formula is satisfied:
ρ for all clusters cc′ This confidence determination is made from high to low, and non-clustered clusters are labeled and merged into the nearest cluster c'.
Wherein, in step 6, the step of mapping the grid clustering result to the original dataset:
marking the grid unit corresponding to the data point as the cluster to which the grid belongs, establishing a lookup table of data and grid space, and setting T g The cluster in the table is recorded in the data aggregation result table.
According to the density peak clustering method of the self-adaptive scale grid and the diffusion intensity, raw data points are mapped into a grid space according to the self-adaptive scale grid division, the grid is divided into a dense grid and a sparse grid according to grid density threshold division, and the data points are mapped into a grid space T g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step; dividing grids according to density fluctuation and calculating connectivity; calculating the relative distance and the diffusion intensity, screening the assumed center and dividing the initial cluster; searching density peaks of inter-cluster edge grids; combining multiple clusters; mapping the grid clustering result to the original data set; mapping the dataset to a grid space by utilizing adaptive scale grid partitioning; defining an adaptive division scale and a density fluctuation formula, and replacing the local density of the contained data points by the density of a single grid unit; defining the center of the diffusion intensity calculation grid; designing a central grid screening scheme and an allocation strategy to obtain a clustering result; compared with the original DPC algorithm, the method can process data clustering with higher dimensionality according to the mesh division advantages and the new cluster allocation strategy, and parameters such as a cutting distance and the like are not required to be preset; the time complexity is reduced, and the clustering speed is improved; meanwhile, the problem of high-dimensional failure existing in the original Euclidean distance is solved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the steps of the density peak clustering method of the present invention for adaptive scale grid and diffusion strength.
Detailed Description
Referring to fig. 1, the invention provides a density peak clustering method of self-adaptive scale grids and diffusion intensity, which comprises the following steps:
step 1: mapping the original data points into a grid space according to the self-adaptive scale grid division, dividing the grid into a dense grid and a sparse grid according to the grid density threshold division, and mapping the data points into a grid space T g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step;
step 2: dividing grids according to density fluctuation and calculating connectivity;
step 3: calculating the relative distance and the diffusion intensity, screening the assumed center and dividing the initial cluster;
step 4: searching density peaks of inter-cluster edge grids;
step 5: combining multiple clusters;
step 6: the mesh cluster result is mapped to the original dataset.
In the present embodiment, the raw data points are mapped into the grid space according to the adaptive scale grid division, and the grid is divided into a dense grid and a sparse grid according to the grid density threshold division, and the data points are mapped into the grid space T g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step; dividing grids according to density fluctuation and calculating connectivity; calculating relative distance and diffusion intensity, and screening the assumed centerDividing an initial cluster; searching density peaks of inter-cluster edge grids; combining multiple clusters; mapping the grid clustering result to the original data set; mapping the dataset to a grid space by utilizing adaptive scale grid partitioning; defining an adaptive division scale and a density fluctuation formula, and replacing the local density of the contained data points by the density of a single grid unit; defining the center of the diffusion intensity calculation grid; designing a central grid screening scheme and an allocation strategy to obtain a clustering result; compared with the original DPC algorithm, the method can process data clustering with higher dimensionality according to the mesh division advantages and the new cluster allocation strategy, and parameters such as a cutting distance and the like are not required to be preset; the time complexity is reduced, and the clustering speed is improved; meanwhile, the problem of high-dimensional failure existing in the original Euclidean distance is solved.
Further, in step 2, in the step of dividing the mesh according to the density fluctuation and calculating the connectivity:
will ρ g < θ is recorded as sparse network, otherwise is recorded as dense grids, and connectivity h of each dense grid is calculated g And a global communication branch number ω.
Further, in step 3, the relative distance and the diffusion intensity are calculated, and in the step of screening the assumed center and dividing the initial cluster:
computing a mesh ob using chebyshev distances in d-dimensional space i Is a relative distance delta of (2) i Wherein ob ip Is the p-dimensional coordinates of the grid:
after calculating the centrality of each grid, arranging the centrality in a descending order and starting traversing searching, if delta is around grid i i Within a radius range of gamma i >γ j (j+.i), the grid is the hypothetical center; the non-central grids are then divided into clusters represented by the centers according to the nearest neighbor grids and the grids with the highest gamma valuesAnd the step is performed until the rest grids are completely divided.
Further, in step 4, in the step of searching for the density peak value of the inter-cluster edge grid, the edge grid between clusters is searched according to the following conditions:
find grid i in cluster c, at its delta i There are grids j belonging to other clusters c' within range;
i is the nearest grid to the j grid in cluster c;
the edge grid density peak between a pair of clusters c and c' is denoted as ρ cc′
Further, in step 5, in the step of multi-cluster merging:
judging whether the clusters need to be combined or not through density peaks of the inter-cluster edge grids, and if the density peaks of the inter-cluster edge grids are higher, indicating that the similarity of the two clusters is higher; if the density peak of the cluster c where the grid is located is within a certain density fluctuation range, the cluster should be considered to be merged into the adjacent cluster c', that is, the following formula is satisfied:
ρ for all clusters cc′ This confidence determination is made from high to low, and non-clustered clusters are labeled and merged into the nearest cluster c'.
Further, in step 6, in the step of mapping the mesh clustering result to the original dataset:
marking the grid unit corresponding to the data point as the cluster to which the grid belongs, establishing a lookup table of data and grid space, and setting T g The cluster in the table is recorded in the data aggregation result table.
In the present embodiment, the following definitions are set:
adaptive scale grid: treating the dataset as a d-dimensional feature space t=a 1 ×A 2 ...×A d N is the number of data set samples. Each dimension A in the feature space i (i= {1,2., d }) dividing the feature space into a grid space T according to an adaptive scale m g The calculation formula of m is as follows:
grid density: the number of data points in each grid cell is denoted as ρ g . Each grid cell ob i The data points are a set, the local density of the data points in the set is the same as that of the data points in the set, and the data points are the density ρ of the grid cells i =ρ gi
Global density fluctuation value: and calculating global density fluctuation according to the density peak grid and the density minimum grid, wherein the formula is as follows:
θ=ε(ρ maxmin ):
wherein the empirical parameter ε is typically set to 1/4;
grid connectivity h g : describing the number of grids which can be reached by one grid through an adjacent grid, and describing the size of a communication branch where the grid is positioned;
diffusion strength: the diffusibility of the grid and the size h of the communication branch where the grid is located g The calculation formula of the diffusion intensity of each grid is as follows, which is related to the total number of connected branches omega:
centering degree: describing the weight size of a grid as the cluster center, noted as gamma, the larger the gamma value the more likely it is to be the cluster center, the centrality of each grid is calculated based on two assumptions:
1) The density peak grid should be higher than the surrounding grid density and at the same time be farther from the nearest grid that is more dense than the grid;
2) The grids in the branches are communicated with a certain larger communication degree, and even if the density is high, the possibility of the grids becoming central grids is not high, namely, a plurality of density peak grids exist in a high-density area of the cluster;
obtaining a gamma value calculation formula according to the assumption:
γ i =ρ i δ ii
based on the above definition: step 1: mapping the original data points into a grid space according to the self-adaptive scale grid division, dividing the grid into a dense grid and a sparse grid according to the grid density threshold division, and mapping the data points into a grid space T g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step; step 2: meshing according to density fluctuation, calculating connectivity, and determining ρ g < θ is recorded as sparse network, otherwise is recorded as dense grids, and connectivity h of each dense grid is calculated g And a global communication branch number ω; step 3: calculating relative distance and diffusion intensity, screening an assumed center and dividing an initial cluster, and calculating a grid ob by using Chebyshev distance in d-dimensional space i Is a relative distance delta of (2) i Wherein ob ip Is the p-dimensional coordinates of the grid:
after calculating the centrality of each grid, arranging the centrality in a descending order and starting traversing searching, if delta is around grid i i Within a radius range of gamma i >γ j (j+.i), the grid is the hypothetical center; dividing the non-central grids into clusters represented by the center according to grids which are nearest neighbors and have the highest gamma values, and executing the step until all the rest grids are divided; step 4: searching the density peak value of the edge grids among clusters, finding the grid i in the cluster c, and finding the delta of the grid i i There are grids j belonging to other clusters c' within range; i is cluster c in-cluster dissociationj the nearest grid; the edge grid density peak between a pair of clusters c and c' is denoted as ρ cc′ The method comprises the steps of carrying out a first treatment on the surface of the Step 5: multiple clusters are combined, whether the clusters are needed to be combined or not is judged through density peaks of edge grids among the clusters, and if the density peaks of the edge grids among the clusters are higher, the similarity of the two clusters is higher; if the density peak of the cluster c where the grid is located is within a certain density fluctuation range, the cluster should be considered to be merged into the adjacent cluster c', that is, the following formula is satisfied:
ρ for all clusters cc′ Judging the confidence level from high to low, marking non-cluster clusters, and merging the non-cluster clusters into a cluster c' closest to the non-cluster clusters; step 6: mapping the grid clustering result to an original data set, marking the grid unit corresponding to the data point as a cluster to which the grid belongs, establishing a lookup table of data and grid space, and carrying out T g The cluster in the cluster list is recorded in a data aggregation result list; mapping the dataset to a grid space by utilizing adaptive scale grid partitioning; defining an adaptive division scale and a density fluctuation formula, and replacing the local density of the contained data points by the density of a single grid unit; defining the center of the diffusion intensity calculation grid; designing a central grid screening scheme and an allocation strategy to obtain a clustering result; compared with the original DPC algorithm, the method can process data clustering with higher dimensionality according to the mesh division advantages and the new cluster allocation strategy, and parameters such as a cutting distance and the like are not required to be preset; the time complexity is reduced, and the clustering speed is improved; meanwhile, the problem of high-dimensional failure existing in the original Euclidean distance is solved.
The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims (6)

1. The density peak clustering method for the self-adaptive scale grid and the diffusion intensity is characterized by comprising the following steps of:
step 1, mapping original data points into a grid space according to adaptive scale grid division, dividing the grid into a dense grid and a sparse grid according to grid density threshold division, and mapping the data points into a grid space T g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step;
step 2, dividing grids according to density fluctuation and calculating connectivity;
step 3, calculating the relative distance and the diffusion intensity, screening the assumed center and dividing the initial cluster;
step 4, searching density peaks of inter-cluster edge grids;
step 5, multi-cluster merging;
and 6, mapping the grid clustering result to the original data set.
2. The method for clustering density peaks of adaptive scale grids and diffusion intensities according to claim 1, wherein in step 2, the step of meshing and calculating connectivity according to density fluctuations:
will ρ g < θ is recorded as sparse network, otherwise is recorded as dense grids, and connectivity h of each dense grid is calculated g And a global communication branch number ω.
3. The method for clustering density peaks of adaptive scale grids and diffusion intensities according to claim 1, wherein in step 3, the relative distance and diffusion intensity are calculated, and the step of screening the hypothetical center and dividing the initial clusters:
computing a mesh ob using chebyshev distances in d-dimensional space i Is a relative distance delta of (2) i Wherein ob ip Is the p-dimensional coordinates of the grid:
after calculating the centrality of each grid, arranging the centrality in a descending order and starting traversing searching, if delta is around grid i i Within a radius range of gamma i >γ j (j+.i), the grid is the hypothetical center; dividing the non-central grids into clusters represented by the center according to grids which are nearest neighbors and have the highest gamma values, and executing the step until all the rest grids are divided.
4. The method for clustering density peaks of adaptive scale grids and diffusion intensities according to claim 1, wherein in the step of searching for density peaks of inter-cluster edge grids in step 4, the edge grids between clusters are searched for according to the following condition:
find grid i in cluster c, at its delta i There are grids j belonging to other clusters c' within range;
i is the nearest grid to the j grid in cluster c;
the edge grid density peak between a pair of clusters c and c' is denoted as ρ cc′
5. The method of clustering density peaks for adaptive scale grids and diffusion intensities according to claim 1, wherein in step 5, the step of multi-cluster merging:
judging whether the clusters need to be combined or not through density peaks of the inter-cluster edge grids, and if the density peaks of the inter-cluster edge grids are higher, indicating that the similarity of the two clusters is higher; if the density peak of the cluster c where the grid is located is within a certain density fluctuation range, the cluster should be considered to be merged into the adjacent cluster c', that is, the following formula is satisfied:
ρ for all clusters cc′ This confidence is done from high to lowJudging, marking non-cluster, and merging the non-cluster into a cluster c' nearest to the non-cluster.
6. The method of density peak clustering of adaptive scale grids and diffusion intensities of claim 1, wherein in step 6, the grid clustering result is mapped to the original dataset:
marking the grid unit corresponding to the data point as the cluster to which the grid belongs, establishing a lookup table of data and grid space, and setting T g The cluster in the table is recorded in the data aggregation result table.
CN202311166189.7A 2023-09-11 2023-09-11 Density peak clustering method for self-adaptive scale grid and diffusion intensity Pending CN117113117A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311166189.7A CN117113117A (en) 2023-09-11 2023-09-11 Density peak clustering method for self-adaptive scale grid and diffusion intensity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311166189.7A CN117113117A (en) 2023-09-11 2023-09-11 Density peak clustering method for self-adaptive scale grid and diffusion intensity

Publications (1)

Publication Number Publication Date
CN117113117A true CN117113117A (en) 2023-11-24

Family

ID=88796358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311166189.7A Pending CN117113117A (en) 2023-09-11 2023-09-11 Density peak clustering method for self-adaptive scale grid and diffusion intensity

Country Status (1)

Country Link
CN (1) CN117113117A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117459418A (en) * 2023-12-25 2024-01-26 天津神州海创科技有限公司 Real-time data acquisition and storage method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117459418A (en) * 2023-12-25 2024-01-26 天津神州海创科技有限公司 Real-time data acquisition and storage method and system
CN117459418B (en) * 2023-12-25 2024-03-08 天津神州海创科技有限公司 Real-time data acquisition and storage method and system

Similar Documents

Publication Publication Date Title
CN117113117A (en) Density peak clustering method for self-adaptive scale grid and diffusion intensity
Li et al. Fast density peaks clustering algorithm based on improved mutual K-nearest-neighbor and sub-cluster merging
Zhuo et al. HCFS: a density peak based clustering algorithm employing a hierarchical strategy
CN115294378A (en) Image clustering method and system
CN115496138A (en) Self-adaptive density peak value clustering method based on natural neighbors
Gong et al. Distributed evidential clustering toward time series with big data issue
Du et al. M3W: Multistep three-way clustering
CN115374851A (en) Gas data anomaly detection method and device
CN110781943A (en) Clustering method based on adjacent grid search
CN108764307A (en) The density peaks clustering method of natural arest neighbors optimization
CN108897847B (en) Multi-GPU density peak clustering method based on locality sensitive hashing
CN110008215A (en) A kind of big data searching method based on improved KD tree parallel algorithm
Xu et al. Ensemble clustering via fusing global and local structure information
Wang et al. A neighborhood-based three-stage hierarchical clustering algorithm
Tsai et al. ANGEL: A new effective and efficient hybrid clustering technique for large databases
CN111814979A (en) Fuzzy set automatic partitioning method based on dynamic programming
CN116720090A (en) Self-adaptive clustering method based on hierarchy
Ding et al. Density peaks clustering algorithm based on improved similarity and allocation strategy
CN116701979A (en) Social network data analysis method and system based on limited k-means
Fahim Adaptive Density-Based Spatial Clustering of Applications with Noise (ADBSCAN) for Clusters of Different Densities.
Xiao et al. A two-stage clustering algorithm based on improved k-means and density peak clustering
CN114359632A (en) Point cloud target classification method based on improved PointNet + + neural network
Mishra et al. Efficient intelligent framework for selection of initial cluster centers
Aung et al. Plurality Rule-based Density and Correlation Coefficient-based Clustering for K-NN
Liu et al. An accurate method of determining attribute weights in distance-based classification algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination