CN117113117A

CN117113117A - Density peak clustering method for self-adaptive scale grid and diffusion intensity

Info

Publication number: CN117113117A
Application number: CN202311166189.7A
Authority: CN
Inventors: 王玥洋; 佘堃; 刘书舟; 于钥
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-09-11
Filing date: 2023-09-11
Publication date: 2023-11-24

Abstract

The invention relates to the technical field of data aggregation, in particular to a density peak clustering method of self-adaptive scale grids and diffusion strength; mapping the original data points into a grid space according to the self-adaptive scale grid division, dividing the grid into a dense grid and a sparse grid according to the grid density threshold division, and mapping the data points into a grid space T _g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step; dividing grids according to density fluctuation and calculating connectivity; calculating relative distance and diffusion intensity, screening the assumed center anddividing an initial cluster; searching density peaks of inter-cluster edge grids; combining multiple clusters; mapping the grid clustering result to the original data set; by the mode, the effect of processing the data clustering with higher dimensionality can be obtained.

Description

Density peak clustering method for self-adaptive scale grid and diffusion intensity

Technical Field

The invention relates to the technical field of data aggregation, in particular to a density peak clustering method of self-adaptive scale grids and diffusion intensity.

Background

The cluster analysis is an unsupervised learning method, and aims to divide data into different clusters according to specific standards, explore implicit information of the data, and is widely applied to the fields of data analysis, image processing, bioinformatics, pattern recognition, machine learning and the like as a data analysis method; at present, in the data set clustering process, a density peak clustering algorithm, namely a DPC algorithm is adopted, but when the data set clustering is carried out by adopting the density peak clustering algorithm, the data clustering with higher dimensionality cannot be processed.

Disclosure of Invention

The invention aims to provide a density peak clustering method of self-adaptive scale grids and diffusion intensity, which can obtain the effect of processing data clustering with higher dimensionality.

In order to achieve the above purpose, the density peak clustering method of the self-adaptive scale grid and the diffusion intensity adopted by the invention comprises the following steps:

step 1, mapping original data points into a grid space according to adaptive scale grid division, dividing the grid into a dense grid and a sparse grid according to grid density threshold division, and mapping the data points into a grid space T _g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step;

step 2, dividing grids according to density fluctuation and calculating connectivity;

step 3, calculating the relative distance and the diffusion intensity, screening the assumed center and dividing the initial cluster;

step 4, searching density peaks of inter-cluster edge grids;

step 5, multi-cluster merging;

and 6, mapping the grid clustering result to the original data set.

Wherein, in step 2, in the step of dividing the grids according to the density fluctuation and calculating the connectivity:

will ρ _g < θ is recorded as sparse network, otherwise is recorded as dense grids, and connectivity h of each dense grid is calculated _g And a global communication branch number ω.

Wherein, in step 3, the relative distance and the diffusion intensity are calculated, and in the step of screening the assumed center and dividing the initial clusters:

computing a mesh ob using chebyshev distances in d-dimensional space _i Is a relative distance delta of (2) _i Wherein ob _ip Is the p-dimensional coordinates of the grid:

after calculating the centrality of each grid, arranging the centrality in a descending order and starting traversing searching, if delta is around grid i _i Within a radius range of gamma _i ＞γ _j (j+.i), the grid is the hypothetical center; dividing the non-central grids into clusters represented by the center according to grids which are nearest neighbors and have the highest gamma values, and executing the step until all the rest grids are divided.

In step 4, in the step of searching the density peak value of the inter-cluster edge grid, the edge grid among the clusters is searched according to the following conditions:

find grid i in cluster c, at its delta _i There are grids j belonging to other clusters c' within range;

i is the nearest grid to the j grid in cluster c;

the edge grid density peak between a pair of clusters c and c' is denoted as ρ _cc′ 。

Wherein, in step 5, in the step of multi-cluster merging:

judging whether the clusters need to be combined or not through density peaks of the inter-cluster edge grids, and if the density peaks of the inter-cluster edge grids are higher, indicating that the similarity of the two clusters is higher; if the density peak of the cluster c where the grid is located is within a certain density fluctuation range, the cluster should be considered to be merged into the adjacent cluster c', that is, the following formula is satisfied:

ρ for all clusters _cc′ This confidence determination is made from high to low, and non-clustered clusters are labeled and merged into the nearest cluster c'.

Wherein, in step 6, the step of mapping the grid clustering result to the original dataset:

marking the grid unit corresponding to the data point as the cluster to which the grid belongs, establishing a lookup table of data and grid space, and setting T _g The cluster in the table is recorded in the data aggregation result table.

According to the density peak clustering method of the self-adaptive scale grid and the diffusion intensity, raw data points are mapped into a grid space according to the self-adaptive scale grid division, the grid is divided into a dense grid and a sparse grid according to grid density threshold division, and the data points are mapped into a grid space T _g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step; dividing grids according to density fluctuation and calculating connectivity; calculating the relative distance and the diffusion intensity, screening the assumed center and dividing the initial cluster; searching density peaks of inter-cluster edge grids; combining multiple clusters; mapping the grid clustering result to the original data set; mapping the dataset to a grid space by utilizing adaptive scale grid partitioning; defining an adaptive division scale and a density fluctuation formula, and replacing the local density of the contained data points by the density of a single grid unit; defining the center of the diffusion intensity calculation grid; designing a central grid screening scheme and an allocation strategy to obtain a clustering result; compared with the original DPC algorithm, the method can process data clustering with higher dimensionality according to the mesh division advantages and the new cluster allocation strategy, and parameters such as a cutting distance and the like are not required to be preset; the time complexity is reduced, and the clustering speed is improved; meanwhile, the problem of high-dimensional failure existing in the original Euclidean distance is solved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the steps of the density peak clustering method of the present invention for adaptive scale grid and diffusion strength.

Detailed Description

Referring to fig. 1, the invention provides a density peak clustering method of self-adaptive scale grids and diffusion intensity, which comprises the following steps:

step 1: mapping the original data points into a grid space according to the self-adaptive scale grid division, dividing the grid into a dense grid and a sparse grid according to the grid density threshold division, and mapping the data points into a grid space T _g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step;

step 2: dividing grids according to density fluctuation and calculating connectivity;

step 3: calculating the relative distance and the diffusion intensity, screening the assumed center and dividing the initial cluster;

step 4: searching density peaks of inter-cluster edge grids;

step 5: combining multiple clusters;

step 6: the mesh cluster result is mapped to the original dataset.

In the present embodiment, the raw data points are mapped into the grid space according to the adaptive scale grid division, and the grid is divided into a dense grid and a sparse grid according to the grid density threshold division, and the data points are mapped into the grid space T _g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step; dividing grids according to density fluctuation and calculating connectivity; calculating relative distance and diffusion intensity, and screening the assumed centerDividing an initial cluster; searching density peaks of inter-cluster edge grids; combining multiple clusters; mapping the grid clustering result to the original data set; mapping the dataset to a grid space by utilizing adaptive scale grid partitioning; defining an adaptive division scale and a density fluctuation formula, and replacing the local density of the contained data points by the density of a single grid unit; defining the center of the diffusion intensity calculation grid; designing a central grid screening scheme and an allocation strategy to obtain a clustering result; compared with the original DPC algorithm, the method can process data clustering with higher dimensionality according to the mesh division advantages and the new cluster allocation strategy, and parameters such as a cutting distance and the like are not required to be preset; the time complexity is reduced, and the clustering speed is improved; meanwhile, the problem of high-dimensional failure existing in the original Euclidean distance is solved.

Further, in step 2, in the step of dividing the mesh according to the density fluctuation and calculating the connectivity:

Further, in step 3, the relative distance and the diffusion intensity are calculated, and in the step of screening the assumed center and dividing the initial cluster:

after calculating the centrality of each grid, arranging the centrality in a descending order and starting traversing searching, if delta is around grid i _i Within a radius range of gamma _i ＞γ _j (j+.i), the grid is the hypothetical center; the non-central grids are then divided into clusters represented by the centers according to the nearest neighbor grids and the grids with the highest gamma valuesAnd the step is performed until the rest grids are completely divided.

Further, in step 4, in the step of searching for the density peak value of the inter-cluster edge grid, the edge grid between clusters is searched according to the following conditions:

i is the nearest grid to the j grid in cluster c;

Further, in step 5, in the step of multi-cluster merging:

Further, in step 6, in the step of mapping the mesh clustering result to the original dataset:

In the present embodiment, the following definitions are set:

adaptive scale grid: treating the dataset as a d-dimensional feature space t=a ₁ ×A ₂ ...×A _d N is the number of data set samples. Each dimension A in the feature space _i (i= {1,2., d }) dividing the feature space into a grid space T according to an adaptive scale m _g The calculation formula of m is as follows:

grid density: the number of data points in each grid cell is denoted as ρ _g . Each grid cell ob _i The data points are a set, the local density of the data points in the set is the same as that of the data points in the set, and the data points are the density ρ of the grid cells _i ＝ρ _gi ；

Global density fluctuation value: and calculating global density fluctuation according to the density peak grid and the density minimum grid, wherein the formula is as follows:

θ＝ε(ρ _max -ρ _min )：

wherein the empirical parameter ε is typically set to 1/4;

grid connectivity h _g : describing the number of grids which can be reached by one grid through an adjacent grid, and describing the size of a communication branch where the grid is positioned;

diffusion strength: the diffusibility of the grid and the size h of the communication branch where the grid is located _g The calculation formula of the diffusion intensity of each grid is as follows, which is related to the total number of connected branches omega:

centering degree: describing the weight size of a grid as the cluster center, noted as gamma, the larger the gamma value the more likely it is to be the cluster center, the centrality of each grid is calculated based on two assumptions:

1) The density peak grid should be higher than the surrounding grid density and at the same time be farther from the nearest grid that is more dense than the grid;

2) The grids in the branches are communicated with a certain larger communication degree, and even if the density is high, the possibility of the grids becoming central grids is not high, namely, a plurality of density peak grids exist in a high-density area of the cluster;

obtaining a gamma value calculation formula according to the assumption:

γ _i ＝ρ _i δ _i -η _i 。

based on the above definition: step 1: mapping the original data points into a grid space according to the self-adaptive scale grid division, dividing the grid into a dense grid and a sparse grid according to the grid density threshold division, and mapping the data points into a grid space T _g Calculating the density of each grid, and regarding the grid units as clustering objects in the subsequent clustering step; step 2: meshing according to density fluctuation, calculating connectivity, and determining ρ _g < θ is recorded as sparse network, otherwise is recorded as dense grids, and connectivity h of each dense grid is calculated _g And a global communication branch number ω; step 3: calculating relative distance and diffusion intensity, screening an assumed center and dividing an initial cluster, and calculating a grid ob by using Chebyshev distance in d-dimensional space _i Is a relative distance delta of (2) _i Wherein ob _ip Is the p-dimensional coordinates of the grid:

after calculating the centrality of each grid, arranging the centrality in a descending order and starting traversing searching, if delta is around grid i _i Within a radius range of gamma _i ＞γ _j (j+.i), the grid is the hypothetical center; dividing the non-central grids into clusters represented by the center according to grids which are nearest neighbors and have the highest gamma values, and executing the step until all the rest grids are divided; step 4: searching the density peak value of the edge grids among clusters, finding the grid i in the cluster c, and finding the delta of the grid i _i There are grids j belonging to other clusters c' within range; i is cluster c in-cluster dissociationj the nearest grid; the edge grid density peak between a pair of clusters c and c' is denoted as ρ _cc′ The method comprises the steps of carrying out a first treatment on the surface of the Step 5: multiple clusters are combined, whether the clusters are needed to be combined or not is judged through density peaks of edge grids among the clusters, and if the density peaks of the edge grids among the clusters are higher, the similarity of the two clusters is higher; if the density peak of the cluster c where the grid is located is within a certain density fluctuation range, the cluster should be considered to be merged into the adjacent cluster c', that is, the following formula is satisfied:

ρ for all clusters _cc′ Judging the confidence level from high to low, marking non-cluster clusters, and merging the non-cluster clusters into a cluster c' closest to the non-cluster clusters; step 6: mapping the grid clustering result to an original data set, marking the grid unit corresponding to the data point as a cluster to which the grid belongs, establishing a lookup table of data and grid space, and carrying out T _g The cluster in the cluster list is recorded in a data aggregation result list; mapping the dataset to a grid space by utilizing adaptive scale grid partitioning; defining an adaptive division scale and a density fluctuation formula, and replacing the local density of the contained data points by the density of a single grid unit; defining the center of the diffusion intensity calculation grid; designing a central grid screening scheme and an allocation strategy to obtain a clustering result; compared with the original DPC algorithm, the method can process data clustering with higher dimensionality according to the mesh division advantages and the new cluster allocation strategy, and parameters such as a cutting distance and the like are not required to be preset; the time complexity is reduced, and the clustering speed is improved; meanwhile, the problem of high-dimensional failure existing in the original Euclidean distance is solved.

The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims

1. The density peak clustering method for the self-adaptive scale grid and the diffusion intensity is characterized by comprising the following steps of:

step 4, searching density peaks of inter-cluster edge grids;

step 5, multi-cluster merging;

and 6, mapping the grid clustering result to the original data set.

2. The method for clustering density peaks of adaptive scale grids and diffusion intensities according to claim 1, wherein in step 2, the step of meshing and calculating connectivity according to density fluctuations:

3. The method for clustering density peaks of adaptive scale grids and diffusion intensities according to claim 1, wherein in step 3, the relative distance and diffusion intensity are calculated, and the step of screening the hypothetical center and dividing the initial clusters:

4. The method for clustering density peaks of adaptive scale grids and diffusion intensities according to claim 1, wherein in the step of searching for density peaks of inter-cluster edge grids in step 4, the edge grids between clusters are searched for according to the following condition:

i is the nearest grid to the j grid in cluster c;

5. The method of clustering density peaks for adaptive scale grids and diffusion intensities according to claim 1, wherein in step 5, the step of multi-cluster merging:

ρ for all clusters _cc′ This confidence is done from high to lowJudging, marking non-cluster, and merging the non-cluster into a cluster c' nearest to the non-cluster.

6. The method of density peak clustering of adaptive scale grids and diffusion intensities of claim 1, wherein in step 6, the grid clustering result is mapped to the original dataset: