CN111310809A

CN111310809A - Data clustering method and device, computer equipment and storage medium

Info

Publication number: CN111310809A
Application number: CN202010079585.6A
Authority: CN
Inventors: 于会; 陈芦园; 王星南; 张洁; 董文敏; 杨海泽
Original assignee: Chongqing Yichuang Northwest Industrial Technology Research Institute Co ltd
Current assignee: Chongqing Yichuang Northwest Industrial Technology Research Institute Co ltd
Priority date: 2020-02-04
Filing date: 2020-02-04
Publication date: 2020-06-19

Abstract

The invention is suitable for the technical field of computers, and provides a data clustering method, a device, computer equipment and a storage medium, wherein the data clustering method comprises the following steps: processing the plurality of clustering center data respectively to generate a plurality of corresponding core data sets; processing the rest boundary data, respectively determining a core data set associated with each boundary data, and adding the core data sets into the corresponding core data sets; and clustering and dividing the data to be clustered according to the core data set. The data clustering method provided by the invention firstly divides the data most probably belonging to the same cluster into a core data set, and then determines the relevance between the residual data and the set, compared with the prior art that the relevance between the residual data and the divided data is determined, even if a certain point in the set is wrongly divided, the division of the residual boundary data cannot be seriously influenced, and the technical problem that the existing clustering algorithm depends on the accuracy of the label is solved.

Description

Data clustering method and device, computer equipment and storage medium

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a data clustering method, a data clustering device, computer equipment and a storage medium.

Background

Clustering is the process of dividing a set of physical or abstract objects into multiple clusters (or classes) composed of similar objects, i.e., classifying objects into different clusters, where objects in the same cluster have great similarity and objects in different clusters have great dissimilarity. There are many kinds of current clustering algorithms, such as the DBSCAN Algorithm (Density-Based clustering of Applications with Noise), the DPCA Algorithm (Density peak-Based clustering Algorithm), the KNN Algorithm (k-nearest neighbor clustering Algorithm), wherein the DPCA Algorithm is the more common clustering Algorithm.

However, for the DPCA algorithm, in the final process of the algorithm, each remaining non-clustering center point needs to be classified into a cluster to which the most similar point with a density higher than that of the non-clustering center point belongs, so that a relatively serious label dependence problem exists, and in the distribution process of the non-clustering center points, once a certain point is classified into an incorrect cluster, the points with a relatively low density may be distributed into the same incorrect cluster, so as to generate a chain reaction, and influence the clustering effect.

Therefore, the existing DPCA algorithm also has the technical problems of high dependency on the label and unsatisfactory effect.

Disclosure of Invention

The embodiment of the invention aims to provide a data clustering method, and aims to solve the technical problems that the existing DPCA algorithm has high dependency on labels and an unsatisfactory effect.

The embodiment of the invention is realized in such a way that a data clustering method comprises the following steps:

according to a preset core data determination rule, respectively processing a plurality of clustering center data to generate a plurality of core data sets corresponding to the clustering center data; the plurality of clustering center data are determined by processing data to be clustered through a density peak clustering algorithm, and the core data set comprises first core data and all data, the distance between the first core data and the core data is smaller than a preset core distance threshold;

processing the boundary data according to a preset boundary data processing rule, respectively determining a core data set associated with each boundary data, and respectively adding each boundary data into a core data set corresponding to the boundary data; the boundary data refers to data to be clustered which are not contained in any core data set;

and clustering and dividing the data to be clustered according to the core data set.

Another object of an embodiment of the present invention is to provide a data clustering apparatus, including:

the core data determining module is used for respectively processing the determined multiple clustering center data according to a preset core data determining rule to generate multiple core data sets corresponding to the multiple clustering center data, the multiple clustering center data are determined by processing the data to be clustered through a density peak value clustering algorithm, and the multiple core data sets comprise first core data and all data, the distance between the first core data and the data is smaller than a preset core distance threshold value;

the boundary data processing module is used for processing boundary data according to a preset boundary data processing rule, respectively determining a core data set associated with each boundary data, and respectively adding each boundary data into a core data set corresponding to the boundary data, wherein the boundary data refers to data to be clustered which is not contained in any core data set;

and the data clustering module is used for clustering and dividing the data to be clustered according to the core data set.

It is a further object of an embodiment of the present invention to provide a computer device, including a memory and a processor, where the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the steps of the data clustering method as described above.

It is a further object of an embodiment of the present invention to provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, causes the processor to perform the steps of the data clustering method as described above.

According to the data clustering method provided by the embodiment of the invention, a plurality of clustering centers are respectively processed according to a preset core data determination rule to obtain a plurality of core data sets corresponding to the data of the clustering centers, wherein the data of the clustering centers are determined by processing the data to be clustered through a density peak value clustering algorithm, each core data set comprises first core data and all data of which the distance from the first core data is smaller than a preset core distance threshold value, after the core data sets are determined, the rest data, namely boundary data, are respectively divided into possible core data sets, each core data set can be understood as a classification cluster, and the clustering of the data to be clustered can be directly realized according to the core data sets. The data clustering method of the invention firstly determines a core data set which takes clustering center data as a core and comprises a plurality of core data through a preset core data method, and each core data set meets the following requirements: for any data in the core data set, all data, the distance between which and the core data is less than the preset core distance threshold, are also in the core data set, that is, it can be understood that all data in each core data set are close enough and have a very high probability of belonging to the same cluster, and then the remaining boundary data are divided into clusters that may belong to.

Drawings

FIG. 1 is a flowchart illustrating steps of a data clustering method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps for determining cluster center data according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating another step of determining cluster center data according to an embodiment of the present invention;

fig. 4 is a flowchart illustrating a step of processing a plurality of cluster center data according to a preset core data determination rule according to an embodiment of the present invention;

fig. 5 is a flowchart of a step of processing boundary data according to a preset boundary data processing rule to respectively determine core data sets associated with the boundary data according to an embodiment of the present invention;

fig. 6 is another flowchart of a step of processing boundary data according to a preset boundary data processing rule to respectively determine core data sets associated with the boundary data according to an embodiment of the present invention;

fig. 7 is a flowchart of another step of processing boundary data according to a preset boundary data processing rule to determine core data sets associated with the boundary data respectively according to the embodiment of the present invention;

fig. 8 is a schematic structural diagram of a data clustering device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

First, to facilitate understanding of the present invention, before proceeding to the detailed discussion of each step, a plurality of technical word interpretation, calculation formulas, etc. are provided, which are specifically as follows:

① K-neighbor data:

for a certain data p, its K-neighbor data KNN_pExpressed as:

wherein d is_pqThe calculation formula of the euclidean distance, which is the euclidean distance between the samples p and q, is a routine technical means for those skilled in the art, and is not described herein,

represents the maximum distance between the sample p and K nearest neighbors, K being a preset value. In short, the K-neighbor data of the data p is a set (excluding p) of K data closest to the data p.

② local density:

for some data p, its local density ρ_pExpressed as:

wherein d is_pq、KNN_pThe specific meaning of (a) is not specifically stated, and exp denotes a power function with a base of a natural logarithm. In short, the local density of the data p is a sum of a plurality of power functions with a base natural logarithm and exponentiation of the inverse of the euclidean distance between the K-neighbor data of the data p and the data p.

③ distance from high density point:

for a certain data p, its distance δ from the high density point_pExpressed as:

wherein, q: rho_p＜ρ_qI.e. all data q representing all local densities higher than data p, see ② for a definition of local densities, briefly, the distance between data p and the high density points, i.e. the distance from the data p to the closest point of all points having a local density higher than the local density of data pThe local density of the data p is now positioned as the distance between the data farthest from the data p and the data p.

④ middle distance:

for a certain data p, the middle distance

Expressed as:

wherein the content of the first and second substances,

represents the maximum distance between the sample p and the K nearest neighbors, and

representing the minimum distance between sample p and K nearest neighbors. In short, the median distance of the data p is the average of the maximum distance and the minimum distance of p and its K nearest neighbors.

⑤ support the coefficients:

for any data p and some K-neighbor data q thereof_j∈KNN_p，q_jSupport coefficient for p

Is recorded as:

wherein the content of the first and second substances,

it can be understood as the inverse of the distance between two data,

i.e. data q_jAll K-neighbor data and data q_iThe sum of the inverse of the distance between.

⑥ quality function:

for any data p and some K-neighbor data q thereof_j∈KNN_pIf q is_jAssigned, the label of the cluster is C_iThen x (q)_j) If q is 1_jNot allocated, then χ (q)_j) When q is equal to 0_jMass function to p

Is recorded as:

Θ means a set of all clusters, i.e., { C ═ C₁，C₂，...，C_kThat is, for each sample data p, the K-neighbor data of each sample data p may construct a quality function for the data p, that is, p corresponds to K quality functions, and based on the fusion rule in theory, the K quality functions may be fused into one, which is described in detail below.

⑦ fusion quality function:

fusing K pieces of quality functions of data p to obtain a fused quality function m of p^pThe specific fusion calculation formula is as follows:

wherein

⑧ probability conversion:

for the fusion quality function of data p, when i is 1,2, and k in sequence, it can be determined that data p belongs to cluster C_iProbability of (2)

Now the distribution probability of the data p to the clusters is composed

Among them, the formulas ⑤ - ⑧ are mainly applied to the steps shown in fig. 5.

As shown in fig. 1, a flowchart of steps of a data clustering method provided in an embodiment of the present invention specifically includes the following steps:

step S102, according to a preset core data determination rule, processing a plurality of clustering center data respectively to generate a plurality of core data sets corresponding to the clustering center data.

In the embodiment of the present invention, the plurality of cluster center data are determined by processing the data to be clustered through a density peak value clustering algorithm (DPCA algorithm), that is, the clustering algorithm provided by the present invention can be understood as an improved DPCA algorithm, and the specific content of the improvement lies in the processing after the cluster center data are determined. For a specific process of determining cluster center data, please refer to the following fig. 2 and the description thereof.

In an embodiment of the present invention, the core data set includes first core data and all data whose distance from the first core data is smaller than a preset core distance threshold. The first core data may be any data in the core data set, that is, for any data in the core data set, all data satisfying a distance smaller than a preset core distance threshold from the first core data are also located in the core data set.

In the embodiment of the present invention, as for any data in the core data set, all data that satisfies that the distance to any data in the core data set is smaller than the preset core distance threshold are located in the core data set, it can be equivalently understood that the distance between any two data in the core data set is close enough, that is, the data in the core data set has a very high probability of being located in the same cluster, which is the effect to be achieved by the preset core data determination rule of the present invention, and the specific implementation of the above effect, that is, there are numerous processes for generating a plurality of core data sets in which the data in the sets have a very high probability of being located in the same cluster, according to claim 1 of the present invention, no limitation is made on the specific implementation process, and any core data determination rule that can achieve the above effect is within the scope claimed by the present invention. Furthermore, the present invention also provides a specific core data determination rule, and the specific determination rule refers to the contents of claim 4 and the explanation thereof.

As a preferred embodiment of the present invention, the core distance threshold is selected from a middle distance or a distance related to the middle distance, wherein for the definition of the middle distance of the data, please refer to the aforementioned point ④.

And step S104, processing the boundary data according to a preset boundary data processing rule, respectively determining a core data set associated with each boundary data, and respectively adding each boundary data into a core data set corresponding to the boundary data.

In the embodiment of the present invention, the boundary data refers to data to be clustered which is not included in any core data set.

In the embodiment of the present invention, it is indicated that the data that is not added to the core data set in step S102 does not definitely belong to a certain core data set, and therefore, the boundary data needs to be further processed by using a corresponding processing rule, and then the boundary data is added to the corresponding core data set. The invention does not limit the specific boundary data processing rule, and all rules and algorithms capable of establishing the corresponding relation between the boundary data and the core data set are feasible, namely the invention classifies the boundary data depending on a cluster consisting of a plurality of data points, thereby solving the technical problem that the classification of the residual central point depends on a cluster of another point in the prior art, so that the clustering effect is not stable enough.

In the embodiment of the present invention, a specific boundary data processing rule is further provided, and refer to fig. 5 and the explanation thereof.

And S106, performing cluster division on the data to be clustered according to the core data set.

In the embodiment of the present invention, for the set of data added to a core data set in step S102, it is noted that the core data set is the cluster C_iIs POS (C)_i) It is understood that the set of data added to a core data set in step S104 is denoted as the cluster C, while the cluster belongs to the cluster exactly or with a high probability_iBoundary domain BND (C)_i) Although they also belong to the cluster, they are distributed in the edge region of the cluster, possibly also as members of other clusters. That is, the true classification of a cluster is between the positive domain of the cluster and the union of the positive and boundary domains of the cluster, i.e., C_i′＝[POS(C_i),POS(C_i)∪BND(C_i)]。

As shown in fig. 2, a flowchart of the step of determining cluster center data provided in the embodiment of the present invention specifically includes the following steps:

step S202, calculating the local density of each data to be clustered and the distance between each data to be clustered and a high-density point.

In the embodiment of the present invention, for the explanation of the local density and the distance between the local density and the high density point and the specific calculation formula, please refer to the explanation in items ② and ③, which are not repeated herein.

Step S204, determining all data to be clustered, of which the local density is greater than a preset first density threshold and the distance between the data to be clustered and the high-density point is greater than a preset distance threshold, as clustering center data.

In the embodiment of the present invention, the definition of the cluster center data in the DPCA algorithm is that the data point has both a higher local density and a larger distance from the high-density point, and therefore, by presetting a density threshold and a distance threshold, the data point whose local density is higher than the density threshold and whose distance from the high-density point is higher than the distance threshold is directly determined as the cluster center data.

As shown in fig. 3, another flow chart of the step of determining cluster center data provided in the embodiment of the present invention is different from the flow chart of the step of determining cluster center data shown in fig. 2, and further includes:

step S302, determining all data to be clustered, of which the local density is smaller than a preset second density threshold and the distance between the local density and the high density point is greater than a preset distance threshold, as noise data.

In the embodiment of the invention, considering the situation that the clustering data may have wrong data, namely noise points, caused by statistical errors or other factors, if the data are not removed during clustering, the clustering effect is influenced. Since the noise points are usually isolated and located at the periphery, that is, have a smaller local density and a larger distance from the high-density points, all the data to be clustered, which have a local density smaller than a preset second density threshold and a distance from the high-density points larger than a preset distance threshold, are determined as the noise data. And in the subsequent data processing process, noise data can be removed from the data set to be clustered so as to improve the clustering effect.

As shown in fig. 4, a flowchart of the step of processing the plurality of cluster center data according to the preset core data determination rule according to the embodiment of the present invention specifically includes the following steps:

step S402, processing the first data added into the core data set associated with the cluster center data based on a core distance algorithm, and determining second data meeting a preset processing condition.

In the embodiment of the present invention, in the first processing, the first data added to the core data set associated with the cluster center data is cluster center data.

In the embodiment of the present invention, the step of processing the data to be processed based on the core distance algorithm specifically includes: determining core data from the K nearest neighbor data of the data to be processed, wherein the core data refers to data, the distance between which and the data to be processed is smaller than a preset core distance threshold value. Wherein only the first data need to be understood as the data interface to be processed.

In this embodiment of the present invention, the second data meeting the preset processing condition refers to core data that is not in the core data set.

Step S404, adding the second data into the core data set, and determining the second data as the first data added into the core data set associated with the cluster center data.

In this embodiment of the present invention, after the second data is added to the core data set again and is determined as the first data again, the process returns to step S402 to perform the processing again, that is, step S402 and step S404 may be understood as a process of loop processing, until after a certain time of processing, the second data meeting the preset processing condition no longer exists, and the loop is ended, at this time, the current core data set meets a condition that the current core data set includes the first core data and all data whose distance from the first core data is smaller than the preset core distance threshold.

In the embodiment of the present invention, in order to facilitate understanding of the process of processing the plurality of cluster center data respectively provided by the present invention, a simple example will be explained below.

For cluster center data point A and other data to be clustered B, C, D, E, F, G, etc., first adding cluster center data point A to a core data set associated with point A and determining as first data, determining that the core data of A includes B and C based on a core distance algorithm (please refer to the above description for the calculation process), since B and C are not currently in the core data set associated with point A, adding B and C to the core data set associated with point A and determining as first data, then determining that the core data of B includes A and C using the core distance algorithm for B and C, respectively, and the core data of C includes A and D, at this time, A, C is already in the core data set associated with point A, only D is second data satisfying a predetermined processing condition, adding D to the core data set associated with point A while continuing to process D, when it is finally determined that the core data of D includes B and C, the second data having the predetermined processing condition is more cyclic ending, at this time, the core data including A is more accurate as the threshold distance between the core data sets, the threshold distance between the data sets is less accurate, and the threshold distance between the data sets is preferably equal to the threshold values of the data sets, and the threshold values of the data sets are obtained for example, the most accurate, the threshold values of the data sets including the data sets, the threshold values of the data sets that the threshold values are obtained for the data sets that the threshold values of the data sets that are more accurate, the data sets that the data sets include the threshold values are included in which are included in the threshold values that the threshold values are included in the threshold values that are included in the data sets that are included in the threshold values that are included in the threshold values are included in the data sets that are included in the data.

As shown in fig. 5, a flowchart of the steps of processing boundary data according to a preset boundary data processing rule and determining a core data set associated with each boundary data respectively according to an embodiment of the present invention specifically includes the following steps:

in step S502, K nearest neighbor data of the first boundary data are determined.

In the embodiment of the present invention, it is necessary to process each boundary data at the same time, that is, to determine K nearest neighbor data of each boundary data, but for simplicity of description, each boundary data is replaced with the first boundary data.

Step S504, determining probabilities that the first boundary data belongs to each core data set according to the associated core data sets of the K nearest neighbor data and the distance between the first boundary data and the first boundary data, respectively.

In the embodiment of the present invention, the specific calculation process sequentially includes:

determining support coefficients of the K nearest neighbor data of the first boundary data to the first boundary data according to the associated core data set of the K nearest neighbor data and the distance between the K nearest neighbor data and the first boundary data, as explained in the aforementioned item ⑤;

determining a quality function of the K nearest neighbor data of the first boundary data to the first boundary data according to the support coefficients of the K nearest neighbor data of the first boundary data to the first boundary data and the core data set of the K nearest neighbor data, which is explained in the aforementioned item ⑥;

fusing the K nearest neighbor data of the first boundary data with the quality function of the first boundary data, as explained in the aforementioned item ⑦;

the probability distribution of the first boundary data for each cluster (i.e., core data set) is determined according to the fused quality function, as explained in the aforementioned item ⑧.

Step S506, determining whether the probability that any boundary data belongs to each core data set is not less than a preset allocation threshold. If yes, go to step S508; if not, go to step S510.

In the embodiment of the present invention, a probability distribution may be obtained for each boundary data, a matrix may be determined by combining the probability distributions obtained for each boundary data, and when all data in the matrix are smaller than a preset allocation threshold, it indicates that each boundary data is already divided into clusters to which the boundary data most probably belongs at this time, and the loop is ended. When certain data in the matrix is not smaller than the preset distribution threshold, it indicates that there may be certain boundary data that is not divided into the most probable clusters at this time, and further adjustment is needed, and considering that each boundary data is affected by its K-neighbor data, all boundary data need to be adjusted again.

Step S508, re-determining the associated core data set of the first boundary data according to the probability that the first boundary data belongs to each core data set, and returning to the step S504.

In the embodiment of the present invention, the associated core data set of the first boundary data is re-determined according to the probability that each data belongs to each core data set, and then the process returns to step S504 to perform a loop.

Step S510, determining an associated core data set of each boundary data.

In the embodiment of the present invention, the determined associated core data set of each piece of boundary data is the core data set to which each piece of boundary data most likely belongs.

As shown in fig. 6, another step flowchart for processing boundary data according to a preset boundary data processing rule and determining a core data set associated with each boundary data respectively is provided in the embodiment of the present invention, and the difference between the step flowchart shown in fig. 5 is that:

in the step S508, the associated core data set of the first boundary data is re-determined according to the probability that the first boundary data belongs to each core data set, and the step S504 is returned to, specifically:

step S602, associating the core data set corresponding to the maximum value of the probability that the first boundary data belongs to each core data set with the first boundary data.

In the embodiment of the present invention, since the distribution probability side of the data reflects the probability that the data belongs to the corresponding cluster, the greater the distribution probability of the data for a certain cluster, the more likely the data belongs to the cluster, and therefore, the boundary data may be divided into the corresponding core data sets, that is, the core data set corresponding to the maximum value of the probability that the first boundary data belongs to each core data set is associated with the first boundary data.

As shown in fig. 7, a flowchart of still another step for processing boundary data according to a preset boundary data processing rule and determining a core data set associated with each boundary data respectively provided in the embodiment of the present invention is different from the flowchart of the step shown in fig. 6 in that:

in step S508, after re-determining the associated core data set of the first boundary data according to the probability that the first boundary data belongs to each core data set, the method further includes:

step S702 determines whether an overlapped core data set exists. When the judgment is yes, step S704 is performed; and when the judgment is no, executing other steps.

In this embodiment of the present invention, the overlapped core data set is a core data set that satisfies a condition that a difference between a probability that the first boundary data belongs to the overlapped core data set and a maximum value of the probability is smaller than a preset overlap threshold, except for a core data set corresponding to the maximum value of the probability.

The step S704 is to associate the first boundary data with the overlapped core data set.

In the embodiment of the present invention, considering that the greater the distribution probability of data for a certain cluster, the more likely the data belongs to the cluster, and if there is a smaller difference between the distribution probability of another cluster and the maximum probability, the data also has a certain probability to belong to the cluster, the data may be added to the cluster at the same time, and the most likely associated core data set may be further determined in the subsequent processing process.

In order to further show the clustering effect of the data clustering method provided by the invention, the clustering effect of the invention is evaluated by specific sample data, and mainly a conventional DPCA clustering algorithm and a clustering algorithm (3W-DPET algorithm for short) provided by the invention are respectively adopted for clustering aiming at 8 special-shaped data sets and 8 artificial data sets, and 3 effective clustering indexes are simultaneously adopted to evaluate the clustering performance: the accuracy ACC, the landed index ARI and the standard mutual information NMI are adjusted, and the three validity clustering indexes belong to the conventional technical means of those skilled in the art, and are not described herein again. Wherein the basic information of the data set used in the experimental procedure is as follows:

table 1: data set information

For each data set, the performance indexes of clustering by adopting the clustering method disclosed by the invention and the conventional DPCA clustering method are as follows:

table 2: clustering performance comparison

The experimental data are combined to show that except the Credit data set, the 3W-DPET clustering algorithm disclosed by the invention has certain performance improvement on the accuracy, the adjusted landed index and the standard mutual information compared with the conventional DPCA.

Fig. 8 is a schematic structural diagram of a data clustering device according to an embodiment of the present invention, which is described in detail below.

In the embodiment of the present invention, the data clustering apparatus specifically includes a core data determining module 810, a boundary data processing module 820, and a data clustering module 830.

The core data determining module 810 is configured to respectively process the determined multiple pieces of clustering center data according to a preset core data determining rule, and generate multiple core data sets corresponding to the multiple pieces of clustering center data.

In the embodiment of the present invention, the plurality of cluster center data are determined by processing the data to be clustered through a density peak value clustering algorithm (DPCA algorithm), that is, the clustering algorithm provided by the present invention can be understood as an improved DPCA algorithm, and the specific content of the improvement lies in the processing after the cluster center data are determined.

In the embodiment of the present invention, as for any data in the core data set, all data that satisfies that the distance to any data in the core data set is smaller than the preset core distance threshold are located in the core data set, it can be equivalently understood that the distance between any two data in the core data set is close enough, that is, the data in the core data set has a very high probability of being located in the same cluster, which is the effect to be achieved by the preset core data determination rule of the present invention, and the specific implementation of the above effect, that is, there are numerous processes for generating a plurality of core data sets in which the data in the sets have a very high probability of being located in the same cluster, according to claim 1 of the present invention, no limitation is made on the specific implementation process, and any core data determination rule that can achieve the above effect is within the scope claimed by the present invention.

The boundary data processing module 820 is configured to process the boundary data according to a preset boundary data processing rule, determine a core data set associated with each boundary data, and add each boundary data to the core data set corresponding to the boundary data.

In the embodiment of the present invention, for data that is not added to the core data set in the core data determining module 810, it indicates that the data does not definitely belong to a certain core data set, and therefore, the boundary data needs to be further processed by using corresponding processing rules, and then the boundary data is added to the corresponding core data set. The invention does not limit the specific boundary data processing rule, and all rules and algorithms capable of establishing the corresponding relation between the boundary data and the core data set are feasible, namely the invention classifies the boundary data depending on a cluster consisting of a plurality of data points, thereby solving the technical problem that the classification of the residual central point depends on a cluster of another point in the prior art, so that the clustering effect is not stable enough.

The data clustering module 830 performs cluster division on the data to be clustered according to the core data set.

In the embodiment of the present invention, for a set of data added to a core data set in the core data determination module 810, it is noted that the core data set is the cluster C_iIs POS (C)_i) It is understood that the cluster C is a set of data added to a core data set in the boundary data processing module 820_iBoundary domain BND (C)_i) Although they also belong to the cluster, they are distributed in the edge region of the cluster, possibly also as members of other clusters. That is, the true classification of a cluster is between the positive domain of the cluster and the union of the positive and boundary domains of the cluster, i.e., C_i′＝[POS(C_i),POS(C_i)∪BND(C_i)]。

According to the data clustering device provided by the embodiment of the invention, a plurality of clustering centers are respectively processed according to a preset core data determination rule to obtain a plurality of core data sets corresponding to the clustering center data, wherein the plurality of clustering center data are determined by processing the data to be clustered through a density peak value clustering algorithm, each core data set comprises first core data and all data, the distance between the first core data and the data is smaller than a preset core distance threshold value, after the core data sets are determined, the rest data, namely boundary data, are respectively divided into possible core data sets, each core data set can be understood as a classification cluster, and the clustering of the data to be clustered can be directly realized according to the core data sets. The data clustering device of the invention firstly determines a core data set which takes clustering center data as a core and comprises a plurality of core data through a preset core data method, and each core data set meets the following requirements: for any data in the core data set, all data, the distance between which and the core data is less than the preset core distance threshold, are also in the core data set, that is, it can be understood that all data in each core data set are close enough and have a very high probability of belonging to the same cluster, and then the remaining boundary data are divided into clusters that may belong to.

In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which, when executed by a processor, causes the processor to perform the steps of:

It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for clustering data, comprising:

2. The data clustering method of claim 1, wherein the determining of the plurality of cluster center data comprises:

calculating the local density of each data to be clustered and the distance between each data to be clustered and the high-density point;

and determining all data to be clustered, of which the local density is greater than a preset first density threshold and the distance between the local density and the high-density point is greater than a preset distance threshold, as clustering center data.

3. The data clustering method according to claim 2, further comprising:

and determining all data to be clustered, of which the local density is smaller than a preset second density threshold and the distance between the local density and the high-density point is larger than a preset distance threshold, as noise data.

4. The data clustering method according to claim 1, wherein the step of processing the plurality of clustering center data according to the preset core data determination rule comprises:

processing first data added into a core data set associated with the clustering center data based on a core distance algorithm, and determining second data meeting a preset processing condition;

adding the second data into the core data set, and determining as first data added into the core data set associated with the cluster center data; wherein

When the first processing is carried out, first data added into a core data set associated with the clustering center data is clustering center data;

the step of processing the data to be processed based on the core distance algorithm specifically comprises the following steps:

determining core data from K nearest neighbor data of the data to be processed, wherein the core data are data with a distance smaller than a preset core distance threshold value from the data to be processed;

the second data meeting the preset processing condition refers to core data which is not in the core data set.

5. The data clustering method according to claim 1, wherein the step of processing the boundary data according to a preset boundary data processing rule and determining the core data set associated with each boundary data respectively specifically comprises:

determining K nearest neighbor data of the first boundary data;

respectively determining the probability that the first boundary data belongs to each core data set according to the associated core data sets of the K nearest neighbor data and the distance between the first boundary data and the associated core data sets of the K nearest neighbor data;

when the probability that any boundary data belongs to each core data set is judged to be not smaller than a preset distribution threshold, re-determining the associated core data set of the first boundary data according to the probability that the first boundary data belongs to each core data set, and returning to the step of respectively determining the probability that the first boundary data belongs to each core data set according to the associated core data sets of the K nearest neighbor data and the distance between the first boundary data and the associated core data sets of the K nearest neighbor data;

and when the probability that any boundary data belongs to each core data set is judged to be smaller than a preset distribution threshold, determining the associated core data set of each boundary data.

6. The data clustering method according to claim 5, wherein the step of re-determining the associated core data set of the first boundary data according to the probability that the first boundary data belongs to each core data set specifically comprises:

and associating the core data set corresponding to the maximum value of the probability that the first boundary data belongs to each core data set with the first boundary data.

7. The data clustering method of claim 6, further comprising:

and when judging that an overlapped core data set exists, associating the first boundary data with the overlapped core data set, wherein the overlapped core data set is a core data set which meets the condition that the difference value between the probability that the first boundary data belongs to the overlapped core data set and the maximum value of the probability is smaller than a preset overlap threshold except the core data set corresponding to the maximum value of the probability.

8. A data clustering apparatus, comprising:

9. A computer arrangement comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the data clustering method according to any one of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the steps of the data clustering method according to any one of claims 1 to 7.