CN113780437B

CN113780437B - Improved method of DPC clustering algorithm

Info

Publication number: CN113780437B
Application number: CN202111080561.3A
Authority: CN
Inventors: 伊卫国; 严羚玮; 宋旭东; 宋亮; 苏浩田; 万晓慧
Original assignee: Dalian Jiaotong University
Current assignee: Dalian Jiaotong University
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2024-04-05
Anticipated expiration: 2041-09-15
Also published as: CN113780437A

Abstract

The invention provides an improved method of DPC clustering algorithm, comprising the following steps: s1, selecting an initial clustering center through the mean value distance and the cut-off center; s2, clustering according to Euclidean distances from all data points to each initial clustering center by adopting a K-Means allocation strategy; s3, updating the cluster center, performing center offset, reassigning attributions to all data points, and repeating the operation; s4, judging whether center fusion is needed between clusters; if center fusion is needed, adopting the thought of an iterative fusion method to perform center fusion, and obtaining a new clustering result; if not, the final clustering result in S3 is adopted. The invention provides a new clustering idea, namely a method for searching an initial clustering center based on the maximum mean value distance and fusing each cluster based on high-density connection, and the idea of an iterative fusion method is adopted for central fusion to obtain a better clustering result.

Description

Improved method of DPC clustering algorithm

Technical Field

The invention relates to the technical field of data analysis and mining, in particular to an improved method of a DPC clustering algorithm.

Background

As a recent research field, cluster analysis involves many subjects such as data mining, pattern recognition, machine learning, and data analysis, and along with development of technology, the era of big data has been advanced, and information contained in data has high value. Cluster analysis is intended to divide objects into groups by only data information describing the objects and their relationships, such that the objects within a group are similar and the objects between groups are different.

Clustering essentially aggregates a set of clusters that typically contain all the objects in the dataset (some algorithms recognize noise, which is generally considered to belong to noise clusters). In addition, it may also specify the relationship of clusters to each other, for example, the hierarchical structure of clusters embedded in each other. The following more well-known clustering methods are listed according to the cluster model: 1) Connectivity-based cluster models, corresponding to hierarchical clustering: the hierarchical clustering algorithm has high time complexity. 2) Centroid-based cluster models, corresponding to partitional clustering: the general clustering algorithm has the problem that the number k of clusters is difficult to determine and the initial center point is difficult to select. 3) Based on the distributed cluster model, corresponding to model clustering: although the theoretical basis of these methods is very good, such algorithms are often prone to overfitting. 4) A mesh-based cluster model corresponding to a mesh cluster; but also involves the following disadvantages: (1) poor clustering accuracy; (2) low accuracy. 5) The density-based cluster model, corresponding to density clustering, still has some problems to be solved: (1) Complexity O (n) ² ) The method is high and is not suitable for cluster analysis of large-scale data; (2) The process is not adaptive and the intrinsic parameters cannot be automatically adjusted. For example, the density peak and d cannot be adaptively selected _c The method comprises the steps of carrying out a first treatment on the surface of the (3) The accuracy is easy to influence, and when DPC calculates the local density, if the local structure of the data is not considered, the phenomenon of losing a plurality of clusters, namely 'false peaks' and 'no peaks' is caused, so that the clustering accuracy is influenced; (4) The high-dimensional data has poor applicability because many dimensions in the high-dimensional data are independent of each other, which can result in the loss of some clusters.

Disclosure of Invention

The present invention provides an improved method of DPC clustering algorithms to overcome the above-described problems.

In order to achieve the above object, the present invention provides the following technical solutions:

s1, selecting an initial clustering center through the mean value distance and the cut-off center;

s2, clustering according to Euclidean distances from all data points to each initial clustering center by adopting a K-Means allocation strategy;

s3, updating the cluster center, performing center offset, reassigning attributions to all data points, and repeating the operation; when the Euclidean distance between two points of the new cluster center and the old cluster center is smaller than a set value, stopping updating the cluster center, and taking the last clustering result as a final clustering result;

s4, judging whether center fusion is needed or not between clusters obtained after updating the cluster center; if center fusion is needed, adopting the thought of an iterative fusion method to perform center fusion, and obtaining a new clustering result; if not, the final clustering result in S3 is adopted.

Further, the cluster center is updated continuously, including:

and adopting a K-Means to update the strategy of the cluster center, and calculating the average value of all data points in each cluster to serve as a new cluster center of the current cluster.

Further, reassigning the attributes for all the data points as described in S3 includes:

reassigning attributions to the data points according to the new cluster centers; and calculating the distance from all the data points to each new clustering center by adopting a K-Means distribution strategy, and re-clustering the data points according to the distance.

Further, the central fusing in S4 includes:

s511, traversing all cluster centers, judging whether the clusters are fused or not in pairs, stopping judging when two clusters to be fused are encountered, and returning the two cluster centers to be fused;

s512, solving the density of two cluster centers, taking the label of the cluster center with larger density in the two clusters as a fused cluster label, and returning a label distribution result after cluster fusion;

s513, binding the data set and the label again, and marking; solving the average value center of the same points of the cluster labels as a fusion cluster center;

s514, returning to the new cluster center set after fusion;

and S515, carrying out the fusion of every two cluster cores according to the iteration of the obtained new cluster cores until the new cluster cores can not be fused any more and the central fusion is finished.

Further, determining whether center fusion is required between clusters includes:

s521, taking the straight line distance between two cluster centers as the diameter, taking the midpoint of the straight line distance between the two cluster centers as the center of a circle, and finding out the data points respectively belonging to the two clusters in the circle;

s522, finding out paired pseudo core data points, which specifically comprises the following steps: calculating the distance between data points respectively belonging to two clusters in a circle, and finding out paired points with the distance between the two data points smaller than the truncated radius, namely paired pseudo-core data points; if no paired pseudo core data points exist, the two clusters cannot be fused;

the cutoff radius d _c The calculation formula of (2) is as follows:

d _c ＝maxDist*distPercent/100(1)

wherein distList represents a distance vector; distPercent represents a cut-off percentage, and the selection of distPercent makes corresponding adjustment according to the characteristics of different data sets;

s523, finding out paired true core data points, which specifically comprises the following steps: finding two paired data points with the density being greater than the minimum density from paired pseudo-core data points, namely paired true-core data points; if no pair of true core data points exist, the two clusters cannot be fused; the minimum density is a manually set value;

s524, judging whether the pair of true core book points and the two cluster centers are communicated with each other in a high density,

high density communication includes:

dividing the distance from each true core data point to the clustering center on the same side by the truncated radius to obtain a high-density threshold value on the side; taking the straight line distance between each true core data point on each side and the clustering center on the same side as the diameter, taking the midpoint of the straight line distance as the center of a circle, respectively making a circle, counting the number of high-density data points in the circle, and if the number of the high-density points is larger than or equal to the high-density threshold value of the local side, and the maximum local density in the circle is not more than twice the minimum local density, carrying out high-density communication on the local side, otherwise, not meeting the high-density communication on the local side;

when both sides meet high-density communication, the two clusters can be fused into one cluster, and if one side does not meet high-density communication or both sides do not meet high-density communication, the two clusters cannot be fused;

further, S1, selecting an initial clustering center by using the mean distance and the selected point range of the reduced initial clustering center, including:

s11, narrowing the point selection range of the initial clustering center: calculating the density of all data points, setting a minimum density value, and selecting all data points larger than the minimum density value;

s12, selecting a 1 st clustering center: finding the data point with the maximum density as the 1 st clustering center;

s13, selecting a 2 nd clustering center: excluding data points in the truncated radius of the 1 st clustering center, selecting the data point farthest from the 1 st clustering center from the rest data points, judging whether the distance between the data point and the 1 st clustering center is more than twice the truncated radius, and selecting the data point as the 2 nd clustering center if the distance between the data point and the 1 st clustering center is more than twice the truncated radius; otherwise, the cluster is not selected as the 2 nd cluster center, and the initial cluster center is selected;

s14, selecting a 3 rd clustering center: after selecting the 2 nd clustering center, excluding the data points in the 1 st clustering center cutoff radius and the data points in the 2 nd clustering center cutoff radius, finding the data point with the maximum average value of the sum of the distances from the 1 st clustering center and the 2 nd clustering center from the rest data points, judging whether the distances from the data point to the 1 st clustering center and the 2 nd clustering center are both larger than the twice cutoff radius, and if both the distances are larger than the twice cutoff radius, selecting the data point as the 3 rd clustering center; otherwise, the cluster is not selected as the 3 rd cluster center, and the initial cluster center is selected;

and repeating the step S14 until the distance from the new mean value center to the cluster center selected already is smaller than twice the cut-off radius, and stopping selecting the initial cluster center.

Further, the calculation formula of the local density is:

wherein ρ is _i Represents the local density, n represents the number of data points, d _ij ＝dist(x _i ,x _j ) Representing data point x _i Data point x _j Distance d of (d) _c Representing the radius of truncation, ρ ⁱ Representing data point x _i Is a local density of (c).

The invention provides a new clustering idea, namely an MM-HDC (max mean and high density connection) method for searching an initial clustering center based on the maximum mean value distance and fusing each cluster based on high-density connection, wherein the mean value distance and the cutoff center are firstly utilized to select the initial clustering center, then a K-Means distribution strategy is adopted to perform clustering according to the distance from all data points to each initial clustering center, then the clustering center is continuously updated, center shifting is performed until the position change of the new cluster center and the old cluster center is small, the updating of the clustering center is stopped, and the last clustering result is used as a final clustering result. And finally, performing center fusion by adopting the thought of an iterative fusion method to obtain a better clustering result.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a view of the 1 st clustering visual effect of the invention;

FIG. 3 is a 7 th clustering visualization effect diagram of the present invention;

FIG. 4 is a view of the 3 rd order center fusion visualization of the present invention;

FIG. 5 is a visual illustration of the high density communication effect of the present invention;

FIG. 6 is a 3 rd clustering result on Iris dataset of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, an improved method of DPC clustering algorithm includes:

s3, updating the cluster center, performing center offset, reassigning attributions to all data points, and repeating the operation; when the Euclidean distance between two points of the new cluster center and the old cluster center is smaller than 1, stopping updating the cluster center, and taking the last clustering result as a final clustering result;

s4, judging whether center fusion is needed between clusters; if center fusion is needed, adopting the thought of an iterative fusion method to perform center fusion, and obtaining a new clustering result; if not, the final clustering result in S3 is adopted.

Preferably, the present invention includes:

and adopting a K-Means to update the strategy of the cluster center, and calculating the average value of all points in each cluster to serve as a new cluster center of the current cluster.

Preferably, the method comprises the steps of:

Preferably, the central fusion comprises:

s511, traversing all cluster centers, judging whether the clusters are fused or not, and stopping searching and returning two cluster center indexes to be fused when two clusters to be fused are encountered;

s512, solving the density of two cluster cores, taking the label of the cluster core with large density as a fused cluster label, and returning a label distribution result after cluster fusion;

s513, changing the data set labels into order, comprising: binding the data set and the label again, and marking; solving the average value center of the same points of the cluster labels as a fusion cluster center;

s514, a fusion process of two clusters is visualized through Matplotlib, and a new cluster center set after fusion is returned;

Preferably, determining whether center fusion is required between clusters includes:

s521, taking the straight line distance between two cluster centers as the diameter, taking the midpoint of the straight line distance between the two cluster centers as the center of a circle, and finding out the points respectively belonging to the two clusters in the circle.

Specifically, the cluster centers of two clusters to be fused are designated as a and b, and the distance between the two cluster centers is designated as d _ab The method comprises the steps of carrying out a first treatment on the surface of the Taking the distance between two cluster centers as a diameter to make a circle, and finding out a cluster A epsilon a and a cluster B epsilon B in the circle; straight line distance d between two cluster centers _ab Is of diameter d _ab The center point o of the cluster a epsilon a is the center point of the cluster a, and the point set A, B epsilon B is the center point of the cluster B.

S522, finding a pair of pseudo core points: calculating the distances between every two points belonging to two clusters in the circle, and finding out paired points with the distances between the two points smaller than the cut-off radius, namely paired pseudo core points; if no paired pseudo core points exist, the two clusters cannot be fused;

preferably, the cutoff radius d _c The calculation formula of (2) is as follows:

d _c ＝maxDist*distPercent/100

specifically, find pairs of pseudo core points: d, d _AB ＜d _c The Euclidean distance from each point in the set of points A to each point in the set of points B is denoted as d _AB The cut-off distance chosen when calculating the density of each point in the dataset is denoted d _c The density of each point here refers to d centered around the current point _c The number of points in the radius range, wherein the cut-off distance is set as the cut-off distance parameter in the DPC algorithm; if d _AB The existing distance is smaller than d _c These paired points are called paired pseudo core points (a ', B'), among which points the point belonging to set a is denoted a ', and the point belonging to set B is denoted B'.

S523, finding a pair of true core points: finding out paired points with the densities of the two points being greater than the minimum density from paired pseudo core points, namely paired true core points; if no pair of true core points exist, the two clusters cannot be fused; the minimum density is a manually set value;

specifically, find the true core point: finding ρ from A', B _i ＜ρ _min Pairs of points x e A ', y e B', where x and y are connected in density. Wherein ρ is _i Representing the density of point i ρ _min Representing a minimum density value, wherein if a point with the top 70% of the truncated density rank is taken as a point selection range of an initial clustering center, the point is the density value of the last point with the top 70% of the truncated density rank in descending order; wherein x, y are the pairs of points in the two clusters meeting the above conditions, these pairs of points are called true core points (x, y), i.e., points connected in density, and (x, y) ∈ (A ', B')

S524, judging whether the pair true core points are communicated with the two cluster centers in a high density manner:

dividing the distance from each true core point to the clustering center on the same side by the truncated radius to obtain a high-density threshold value of the side; taking the straight line distance between the true core point of each side and the clustering center of the same side as the diameter, taking the midpoint of the straight line distance as the center of a circle, respectively making a circle, counting the number of high-density points in the circle, and if the number of the high-density points is larger than or equal to the high-density threshold value of the side, carrying out high-density communication on the side, otherwise, not meeting the high-density communication on the side;

when both sides meet high-density communication, the two clusters can be fused into one cluster, and if one side does not meet high-density communication or both sides do not meet high-density communication, the two clusters cannot be fused.

Specifically, determining whether or not high-density communication includes:

(1) Connect ax, d _ax The diameter is round, and the number of points p in the round is high _ax ≥d _ax /d _c

(2) Connect by, d _by The diameter is round, and the number of points p in the round is high _by ≥d _by /d _c

(3)p _ax 、p _by The number of points is not much different, i.e. the local density cannot be more than doubled.

The cluster core a and the true core point x are communicated with each other in a high density mode, and the cluster core b and the true core point y are communicated with each other in a high density mode, so that the cluster cores a and b are communicated with each other in a high density mode, namely, the cluster a and the cluster b can be combined into one cluster. (1) (2) when making circles, the center of the circles are respectively the midpoints of ax and by, d _ax Represents the linear distance from a to x, d _ax /d _c Representing each d _c At least one high density point is arranged in the distance, the other side is the same, and the high density point is ρ _i ＞ρ _min Is a point of (2).

Preferably, S1 selects an initial clustering center by using the mean distance and the selected point range of the reduced initial clustering center, and the method comprises the following steps:

s13, selecting a 2 nd clustering center: excluding points in the truncated radius of the 1 st clustering center, picking the furthest point from the 1 st clustering center in the rest data points, judging whether the distance from the point to the 1 st clustering center is more than twice the truncated radius, and selecting the point as the 2 nd clustering center if the distance from the point to the 1 st clustering center is more than twice the truncated radius; otherwise, the cluster is not selected as the 2 nd cluster center, and the initial cluster center is selected;

s14, selecting a 3 rd clustering center: after selecting the 2 nd clustering center, removing data points in the 1 st clustering center cutoff radius and data points in the 2 nd clustering center cutoff radius, finding out the point with the maximum average value of the sum of distances from the 1 st clustering center and the 2 nd clustering center from the rest points, judging whether the distances from the point to the 1 st clustering center and the 2 nd clustering center are both larger than the twice cutoff radius, and selecting the point as the 3 rd clustering center if the distances are both larger than the twice cutoff radius; otherwise, the cluster is not selected as the 3 rd cluster center, and the initial cluster center is selected;

and repeating the step S14 until the distance between the center of the new mean value and some cluster centers selected before is smaller than twice the truncated radius, and finishing the initial cluster center selection.

Specifically, S11, calculate the density ρ of all data points _i And arranging in a descending order, and setting Δρ=70% as a point selection range of a cluster center; Δρ represents the cut-off percentage under the condition that the selected cluster center is the point with higher density after all the points are ordered in descending density order, and Δρ of the subsequent steps is the meaning

S12, selecting a first clustering center: among the points of Δρ=70%, the point with the greatest density is found as the 1 st cluster center;

s13, selecting a second clustering center: from the points of Δρ=70%, the 1 st cluster center d is excluded _c Among the remaining points with Δρ=70%, the point farthest from the 1 st cluster center is selected, and it is determined whether the distance from the point to the 1 st cluster center is greater than 2*d _c If > 2*d _c The 2 nd cluster center is selected. A step of

Specifically, here > 2*d _c The method is to ensure that a new cluster center and a core point area of a selected cluster center cannot be crossed, otherwise, the core areas of the two points should be integrated into a core area to avoid the formation of fusion points, the new cluster center is not a core point of a first cluster center which is selected and is relatively far away, if the new cluster center is not greater than the first cluster center, the point is not selected as a second cluster center, and the initial cluster center is selected completely, namely S15;

s14, selecting a third clustering center: from these points of Δρ=70%, the 1 st cluster center d is excluded _c Points within radius and cluster center 2 d _c After points within the radius, find from the remaining points of Δρ=70%The point with the maximum average value of the sum of the distances from the 1 st clustering center and the 2 nd clustering center is judged whether the distances from the point to the 1 st clustering center and the 2 nd clustering center are more than 2*d _c If all are > 2*d _c Selecting the clustering center as a 3 rd clustering center;

specifically, here > 2*d _c The method is to ensure that a new cluster center and a core point area of a selected cluster center cannot be crossed, otherwise, the core areas of the two points are integrated into a core area, the formation of fusion points is avoided, and the new cluster center is not the core point of the first cluster center and is relatively far away from the first cluster center; if the distance to each selected cluster center is not satisfied > 2*d _c I.e. one is greater than one and not greater than or neither is greater than, the point is not selected as a third cluster center, and the initial cluster center is selected completely, i.e. S15;

s15, repeating the operation until the distance between the center of the new mean value and some cluster centers selected before is less than 2*d _c And finishing the selection of the initial cluster center.

Preferably, the calculation formula of the local density is:

wherein n represents the number of data points, d _ij ＝dist(x _i ,x _j ) Representing point x _i And point x _j Distance d of (d) _c Representing the radius of truncation, ρ ⁱ Representing data point x _i Is a local density of (c).

Preferably, the calculation formula of the density distance of the current point is:

δ _i ＝max _j (d _ij ) (3)

δ _i ＝min _i d _ij (i:ρ _j ＞ρ _i ) (4)

wherein d _ij ＝dist(x _i ,x _j ) Representing point x _i And point x _j Distance ρ of (1) _i Representing data pointsx _i Local density ρ of _j Representing data point x _j Local density, delta _i Representing the current point x _i Density distance of (2);

if the current point x _i With maximum local density, delta _i Representing data set and point x _i Data point with greatest distance to x _i A distance therebetween; if the current point x _i Without maximum local density, delta _i Indicating that the local density is greater than point x in all _i And x in the data points of (2) _i The data point with the smallest distance to x _i Distance between them.

Example 2

The implementation steps of the present invention are divided into the following two parts: the first part is the initial cluster and the second part is the center fusion, described below as detailed steps of the improved algorithm herein.

1. Initial clustering: maximum mean +K-Means

step1, selecting initial clustering center (see initial clustering center selection strategy)

step2, 1 st clustering

And assigning attributions to all the data points according to the selected initial clustering center. Calculating the distance from all data points to each initial cluster center by adopting a K-Means distribution strategy, and classifying the distance from the initial cluster center to the cluster;

step3, updating cluster center, and performing center offset

step4, reassigning ownership of all data points

And reassigning attributions to all the data points according to the new clustering center. Calculating the distance from all data points to each new cluster center by adopting a K-Means distribution strategy, and classifying the distance from the new cluster center to the cluster;

step5 repeat step3 and step4

step6, setting a stop condition to obtain a final clustering result

Until the change of the position of the new cluster center and the old cluster center is small, namely the distance is small, a small value can be defined, and the updating of the cluster center is stopped. And taking the last clustering result as a final clustering result.

Center fusion: iterative fusion method

The two-cluster fusion comprises a fusion cluster core (the average value center of two clusters of points obtained by using K-Means) and a fusion cluster label (the label of the cluster core with high density), and the specific steps are as follows:

step1, traversing all cluster centers, judging whether the clusters are fused or not in pairs, stopping searching when two clusters to be fused are encountered, and returning two cluster center indexes to be fused

step2 obtaining the density (d) of two cluster centers _c The number of points in the radius), the label of the cluster center with large density is used as a fusion cluster label, and a label distribution result after cluster fusion is returned

step3, changing the labels of the data set into order, and finding the average center of the same points of the cluster labels as the fusion cluster center

step4, visualizing the fusion process of the two clusters, and returning to the new cluster center set after fusion

step5, carrying out the two-to-two cluster core fusion according to the iteration of the obtained new cluster core until the new cluster core can not be fused again and the center fusion is finished.

Principle of determining whether to fuse:

(1) Let the cluster centers of two clusters to be fused be a and b. D is set as _ab For diameter circle, respectively finding out the point set of A E a cluster and B E B cluster in the circle

(2) Find pairs of pseudo core points: d, d _AB ＜d _c Is formed as point a ', B'.

(3) Find the true core point: finding ρ from A', B _i ＜ρ _min Pairs of points x e A ', y e B', where x and y are connected in density.

(4) Judging whether ax and by are communicated in a high density:

1) D is set as _ax The diameter is round, and the number of points p in the round is high _ax ≥d _ax /d _c

2) D is set as _by The diameter is round, and the number of points p in the round is high _by ≥d _by /d _c

3)p _ax 、p _by The number of points is not much different, i.e. the local density cannot be more than doubled.

The cluster cores a and the true core point x are communicated with each other in a high density manner, and the cluster cores b and the true core point y are communicated with each other in a high density manner, so that the cluster cores a and b are communicated with each other in a high density manner, namely, the cluster a and the cluster b can be combined into one cluster.

Note that: d, d _ax Represents the linear distance from a to x, d _ax /d _c Representing each d _c At least one high density point is located in the distance, the high density point is local density>Points of minimum density.

Initial cluster center selection strategy

(1) Calculating the density ρ of all data points _i And the clustering centers are arranged in a descending order, and Δρ=70% is set as a clustering center point selection range;

(2) First cluster center: among these points of Δρ=70%, the point of greatest density is found as the 1 st cluster center;

(3) The second cluster center: from these points of Δρ=70%, the 1 st cluster center d is excluded _c After the points within the radius, the point farthest from the 1 st cluster center is selected from the remaining points at Δρ=70%, and it is determined whether the distance from this point to the 1 st cluster center is > 2*d _c If > 2*d _c Selecting the clustering center as a 2 nd clustering center;

(4) Third cluster center: from these points of Δρ=70%, the 1 st cluster center d is excluded _c Points within radius and cluster center 2 d _c After the points within the radius, find the point with the largest average value of the sum of the distances from the 1 st cluster center and the 2 nd cluster center from the rest points with Δρ=70%, and judge whether the distances from the point to the 1 st cluster center and the 2 nd cluster center are all > 2*d _c If all are > 2*d _c Selecting the clustering center as a 3 rd clustering center;

(5) Selection of the remaining cluster centers and so on, each new mean center excludes d for all cluster centers that have been currently selected from Δρ=70% of these points _c After the points within the radius, find the distance from all cluster centers that have been selected at present from the points that remain 70% of the density rankingThe point of maximum mean value of the sum is reached until the distance from the center of the new mean value to some cluster centers which have been selected before is less than 2*d _c Stopping the selection of the cluster center, and finishing the selection of the initial cluster center.

The invention provides an MM-HDC (max mean and high density connection) method for searching an initial clustering center based on a maximum mean value distance and fusing each cluster based on high-density connection on the basis of combining each research result. First, Δρ=70% is set to select initial cluster centers from the initial cluster centers, and the mean distance is introduced until the distance from the new mean center to some cluster centers already selected before is less than 2*d _c And stopping selecting the cluster center, and finishing the selection of the initial cluster center. And then adopting a K-Means allocation strategy to cluster according to the distance from all data points to each initial cluster center, continuously updating the cluster centers, performing center offset until the position change of the new cluster center and the old cluster center is small (namely, the distance is small), stopping updating the cluster centers, and taking the last clustering result as a final clustering result. And finally, performing center fusion by adopting the thought of an iterative fusion method to obtain a better clustering result. Experimental results of classical data sets show that the MM-HDC algorithm is superior to the DPC algorithm and the K-Means algorithm, and the improved density peak clustering algorithm has higher accuracy. Furthermore, the MM-HDC algorithm can yield satisfactory results on data sets of special shapes or non-uniform distributions.

Example 3

In order to effectively solve a plurality of defects in the field of the existing density peak clustering algorithm, the invention is technically improved from the following multiple angles, thereby realizing the characteristics of remarkable classification effect, high precision, strong practicability and the like, and specifically comprises the following steps:

aiming at the problems, the patent improves the density peak clustering, proposes a new clustering idea, namely an MM-HDC (max mean and high density connection) method for searching an initial clustering center based on the maximum mean distance and fusing each cluster based on high-density connection,

firstly, selecting an initial clustering center by using the mean value distance and the cut-off center,

then adopting a K-Means allocation strategy to cluster according to the distance from all data points to each initial cluster center,

and continuously updating the cluster center, performing center offset until the position change of the new cluster center and the old cluster center is small (namely, the distance is small), stopping updating the cluster center, and taking the last clustering result as a final clustering result. And finally, performing center fusion by adopting the thought of an iterative fusion method to obtain a better clustering result.

The local density ρ is proposed _i Distance delta _i

For any data point i, 2 quantities need to be calculated: local density of data point i and its distance from the nearest point with higher density _i The definition is as follows:

wherein n is the number of data points d _ij ＝dist(x _i ,x _j ) Represents x _i And x _j Is a distance of (2); χ is an indicator function, the function χ (x) is defined as follows: when x < 0, χ (x) =1, otherwise χ (x) =0; d, d _c Is the radius of the truncation. As can be seen, ρ _i Equal to d distributed at i _c The number of data points in the neighborhood, i.e., the density. Delta _i Then it is measured by calculating the minimum distance between point i and other points of higher density:

δ _i ＝min _i d _ij (i:ρ _j ＞ρ _i ) (2)

for the highest density point, delta can be taken _i ＝max _j (d _ij ) Finally, the data points are arranged in descending order of density.

Gaussian kernel：

From the formulas (1) and (3), Gaussian kernel is a continuous value calculation method, so that the latter produces less probability of collisions (i.e., different data points have the same local density). For data point x _i Relative distance delta _i Can be defined as:

as can be seen from equation (4), when x _i Delta with maximum local density _i Representing data set and x _i Data point with greatest distance to x _i Distance between them. Otherwise, delta _i Indicating that the local density is greater than x at all _i And x in the data points of (2) _i The data point (or points) with the smallest distance to x _i Distance between them. For each data point x in the dataset _i Calculating the local density ρ _i And relative distance delta _i . Each point in the dataset is represented as a binary pair (ρ _i ,δ _i ) Drawn on a plane (in ρ _i Is the transverse axis delta _i Vertical axis), referred to as a decision graph. The decision diagram is the key of the DPC algorithm to select the cluster center. Selecting points of the relative upper right area in the decision diagram as cluster centers, wherein p of the points _i And delta _i The value is relatively large, satisfying two characteristics of the cluster center. For data sets with complex decision graphs, it is difficult to select the correct cluster center, and the one-step merging strategy of DPC algorithm to non-cluster center points can also cause a chain reaction, and once a data point is allocated incorrectly, a series of sample data set class cluster errors can be caused. We have improved on the selection of the initial cluster center. And the patent selects 3 evaluation indexes of the contour coefficient, the Karnsky-Harabasi index and the Dyson Barbit index.

Compared with the prior art, the experiment of the invention adopts visual display to the result of data set aggregation, one color represents one class, and the parameters are set to be percentage 1.8 and part 0.7. The initial cluster center is selected according to the strategy of find_centers_auto () function, and the strategy of who is close to and allocated to K-Means is used for carrying out the first clustering, as shown in figure 2 as the first clustering result.

After each clustering, calculating the mean center of the cluster as a new cluster center of the cluster, and then clustering by using a strategy that K-Means is close to who is allocated to who, as shown in figure 3, as a result of the last clustering;

and after the clustering is finished, performing central fusion, performing iterative fusion according to a run_concat () function strategy by adopting an iterative fusion method, wherein the total number of the fusion is 3, and the result of the third fusion is shown in fig. 4.

As can be seen from fig. 2-5: on the data set aggregation, the seventh clustering can be divided into ten types, and after center fusion, the 7 types can be clustered well through continuous iterative updating, so that a satisfactory clustering result is obtained.

The following experiments are performed on different data sets under different algorithms, the experiments respectively cluster the improved algorithm of DPC, the K-Means algorithm and the DPC algorithm on the following 4 data sets, and give corresponding optimal parameters, the bold is the optimal result in each algorithm, the results of the data sets are visually displayed, one color represents a class, and the following table 1 is the comparison of the clustered results:

table 1 detailed comparison of experiments

As can be seen from the comparison experiment results in Table 1, in the aggregation data set, the MM-HDC is higher than the K-Means and DPC algorithms in sc index, and the MM-HDC and DPC algorithms can obtain satisfactory clustering results, but the clustering results of the K-Means algorithm are not ideal; on the pathbased dataset, the MM-HDC algorithm can cluster well into 3 classes here, but DPC does not divide the dataset well into 3 classes, since DPC simply calculates local density and relative distance properties, DPC does not identify all classes well for datasets with uneven dataset distribution, we can also see that the davilsburg index of MM-HDC algorithm is higher than K-Means and DPC algorithms; on the synthetic data set, the MM-HDC algorithm can be clustered into 5 classes well, and the satisfactory effect is achieved; on a flame data set, the DPC algorithm can achieve a good effect, the clustering result of the K-Means algorithm is not ideal, and the MM-HDC algorithm effect is obviously better than that of the K-Means algorithm.

Example 4

Example application

As shown in fig. 6, the clustering algorithm is an important non-supervised learning technique in machine learning, and has been widely used in various fields such as business, bioinformatics, image processing, social networking, e-commerce, and many other fields. Iris dataset is a classical real dataset, which is often used as an example in both the fields of statistical learning and machine learning. The data set contains 150 rows of data in total, and each row of data consists of 4 characteristic values and one target value. The 4 eigenvalues are respectively: sepal length, sepal width, petal length, petal width; the target values are three different types of iris respectively: irisSetosa, irisVersicolour, irisVirginica. It is possible to predict which of the 3 flower varieties the iris flower belongs to from these 4 features.

Selecting the sepal length and the petal length as characteristics, clustering Iris Iris data sets by using an MM-HDC clustering algorithm model, and obtaining a final clustering result through 3 times of clustering as shown in figure 6, wherein the parameter values related to the algorithm are as follows: cut-off distance percentage =2.1, cut-off center percentage part =0.8. For the improved dpc clustering algorithm proposed by the present invention, the evaluation indexes are as follows: sc:0.5218, chi:495.0828, dbi:0.5631. it can be seen that the MM-HDC clustering algorithm provided herein can better gather iris data into 3 categories, and the 3 classical clustering result evaluation indexes of the contour coefficient sc, the kalin-shoba index chi and the davison Ding Bao index dbi are all well behaved, so that the MM-HDC clustering algorithm based on the dpc clustering algorithm improvement provided herein is also applicable to a real data set, can be used for clustering flower varieties according to the data representation of the biological characteristics of a certain flower in actual life, and has practical significance.

The beneficial effects are that:

the invention provides a new clustering idea, namely an MM-HDC (max mean and high density connection) method for searching an initial clustering center based on the maximum mean value distance and fusing each cluster based on high-density connection, wherein the mean value distance and the cutoff center are firstly utilized to select the initial clustering center, then a K-Means distribution strategy is adopted to perform clustering according to the distance from all data points to each initial clustering center, then the clustering center is continuously updated, center shifting is performed until the position change of the new cluster center and the old cluster center is small (namely, the distance is small), the updating of the clustering center is stopped, and the last clustering result is used as a final clustering result. And finally, performing center fusion by adopting the thought of an iterative fusion method to obtain a better clustering result.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. An improvement method of a DPC clustering algorithm is characterized in that the improvement method is used for carrying out clustering analysis on an Iris Iris data set, wherein the data set contains 150 rows of data in total, each row of data consists of 4 characteristic values and a target value, and the 4 characteristic values are respectively: sepal length, sepal width, petal length, petal width; the target values are three different types of iris respectively: irisSetosa, irisVersicolour, irisVirginica, comprising:

s4, judging whether center fusion is needed or not between clusters obtained after updating the cluster center; if center fusion is needed, adopting the thought of an iterative fusion method to perform center fusion, and obtaining a new clustering result; if not, adopting the final clustering result in the S3;

the center fusion in S4 includes:

s514, returning to the new cluster center set after fusion;

s515, carrying out pairwise cluster core fusion according to the obtained new cluster core iteration until the new cluster core can not be fused any more and the center fusion is finished;

judging whether center fusion is needed between clusters or not comprises the following steps:

the cutoff radius d _c The calculation formula of (2) is as follows:

d _c ＝maxDist*distPercent/100(1)

high density communication includes:

2. The improvement of DPC clustering algorithm according to claim 1, wherein the continuously updating cluster center in S3 includes:

3. The improvement to a DPC clustering algorithm of claim 1, wherein the reassigning attributes to all data points in S3 comprises:

4. The improvement method of DPC clustering algorithm according to claim 1, wherein the selecting an initial cluster center by using the mean distance and the reduced selection point range of the initial cluster center in S1 includes:

5. The improvement method of DPC clustering algorithm according to claim 1, wherein the calculation formula of the local density is:

wherein ρ is _i Represents the local density, n represents the number of data points, d _ij ＝dist(x _i ,x _j ) Representing data point x _i Data point x _j Distance d of (d) _c Representing the radius of truncation, ρ _i Representing data point x _i Is a local density of (c).