CN111310809A - Data clustering method and device, computer equipment and storage medium - Google Patents

Data clustering method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111310809A
CN111310809A CN202010079585.6A CN202010079585A CN111310809A CN 111310809 A CN111310809 A CN 111310809A CN 202010079585 A CN202010079585 A CN 202010079585A CN 111310809 A CN111310809 A CN 111310809A
Authority
CN
China
Prior art keywords
data
core
boundary
clustering
core data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010079585.6A
Other languages
Chinese (zh)
Inventor
于会
陈芦园
王星南
张洁
董文敏
杨海泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Yichuang Northwest Industrial Technology Research Institute Co ltd
Original Assignee
Chongqing Yichuang Northwest Industrial Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Yichuang Northwest Industrial Technology Research Institute Co ltd filed Critical Chongqing Yichuang Northwest Industrial Technology Research Institute Co ltd
Priority to CN202010079585.6A priority Critical patent/CN111310809A/en
Publication of CN111310809A publication Critical patent/CN111310809A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is suitable for the technical field of computers, and provides a data clustering method, a device, computer equipment and a storage medium, wherein the data clustering method comprises the following steps: processing the plurality of clustering center data respectively to generate a plurality of corresponding core data sets; processing the rest boundary data, respectively determining a core data set associated with each boundary data, and adding the core data sets into the corresponding core data sets; and clustering and dividing the data to be clustered according to the core data set. The data clustering method provided by the invention firstly divides the data most probably belonging to the same cluster into a core data set, and then determines the relevance between the residual data and the set, compared with the prior art that the relevance between the residual data and the divided data is determined, even if a certain point in the set is wrongly divided, the division of the residual boundary data cannot be seriously influenced, and the technical problem that the existing clustering algorithm depends on the accuracy of the label is solved.

Description

Data clustering method and device, computer equipment and storage medium
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a data clustering method, a data clustering device, computer equipment and a storage medium.
Background
Clustering is the process of dividing a set of physical or abstract objects into multiple clusters (or classes) composed of similar objects, i.e., classifying objects into different clusters, where objects in the same cluster have great similarity and objects in different clusters have great dissimilarity. There are many kinds of current clustering algorithms, such as the DBSCAN Algorithm (Density-Based clustering of Applications with Noise), the DPCA Algorithm (Density peak-Based clustering Algorithm), the KNN Algorithm (k-nearest neighbor clustering Algorithm), wherein the DPCA Algorithm is the more common clustering Algorithm.
However, for the DPCA algorithm, in the final process of the algorithm, each remaining non-clustering center point needs to be classified into a cluster to which the most similar point with a density higher than that of the non-clustering center point belongs, so that a relatively serious label dependence problem exists, and in the distribution process of the non-clustering center points, once a certain point is classified into an incorrect cluster, the points with a relatively low density may be distributed into the same incorrect cluster, so as to generate a chain reaction, and influence the clustering effect.
Therefore, the existing DPCA algorithm also has the technical problems of high dependency on the label and unsatisfactory effect.
Disclosure of Invention
The embodiment of the invention aims to provide a data clustering method, and aims to solve the technical problems that the existing DPCA algorithm has high dependency on labels and an unsatisfactory effect.
The embodiment of the invention is realized in such a way that a data clustering method comprises the following steps:
according to a preset core data determination rule, respectively processing a plurality of clustering center data to generate a plurality of core data sets corresponding to the clustering center data; the plurality of clustering center data are determined by processing data to be clustered through a density peak clustering algorithm, and the core data set comprises first core data and all data, the distance between the first core data and the core data is smaller than a preset core distance threshold;
processing the boundary data according to a preset boundary data processing rule, respectively determining a core data set associated with each boundary data, and respectively adding each boundary data into a core data set corresponding to the boundary data; the boundary data refers to data to be clustered which are not contained in any core data set;
and clustering and dividing the data to be clustered according to the core data set.
Another object of an embodiment of the present invention is to provide a data clustering apparatus, including:
the core data determining module is used for respectively processing the determined multiple clustering center data according to a preset core data determining rule to generate multiple core data sets corresponding to the multiple clustering center data, the multiple clustering center data are determined by processing the data to be clustered through a density peak value clustering algorithm, and the multiple core data sets comprise first core data and all data, the distance between the first core data and the data is smaller than a preset core distance threshold value;
the boundary data processing module is used for processing boundary data according to a preset boundary data processing rule, respectively determining a core data set associated with each boundary data, and respectively adding each boundary data into a core data set corresponding to the boundary data, wherein the boundary data refers to data to be clustered which is not contained in any core data set;
and the data clustering module is used for clustering and dividing the data to be clustered according to the core data set.
It is a further object of an embodiment of the present invention to provide a computer device, including a memory and a processor, where the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the steps of the data clustering method as described above.
It is a further object of an embodiment of the present invention to provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, causes the processor to perform the steps of the data clustering method as described above.
According to the data clustering method provided by the embodiment of the invention, a plurality of clustering centers are respectively processed according to a preset core data determination rule to obtain a plurality of core data sets corresponding to the data of the clustering centers, wherein the data of the clustering centers are determined by processing the data to be clustered through a density peak value clustering algorithm, each core data set comprises first core data and all data of which the distance from the first core data is smaller than a preset core distance threshold value, after the core data sets are determined, the rest data, namely boundary data, are respectively divided into possible core data sets, each core data set can be understood as a classification cluster, and the clustering of the data to be clustered can be directly realized according to the core data sets. The data clustering method of the invention firstly determines a core data set which takes clustering center data as a core and comprises a plurality of core data through a preset core data method, and each core data set meets the following requirements: for any data in the core data set, all data, the distance between which and the core data is less than the preset core distance threshold, are also in the core data set, that is, it can be understood that all data in each core data set are close enough and have a very high probability of belonging to the same cluster, and then the remaining boundary data are divided into clusters that may belong to.
Drawings
FIG. 1 is a flowchart illustrating steps of a data clustering method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating steps for determining cluster center data according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating another step of determining cluster center data according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating a step of processing a plurality of cluster center data according to a preset core data determination rule according to an embodiment of the present invention;
fig. 5 is a flowchart of a step of processing boundary data according to a preset boundary data processing rule to respectively determine core data sets associated with the boundary data according to an embodiment of the present invention;
fig. 6 is another flowchart of a step of processing boundary data according to a preset boundary data processing rule to respectively determine core data sets associated with the boundary data according to an embodiment of the present invention;
fig. 7 is a flowchart of another step of processing boundary data according to a preset boundary data processing rule to determine core data sets associated with the boundary data respectively according to the embodiment of the present invention;
fig. 8 is a schematic structural diagram of a data clustering device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
First, to facilitate understanding of the present invention, before proceeding to the detailed discussion of each step, a plurality of technical word interpretation, calculation formulas, etc. are provided, which are specifically as follows:
① K-neighbor data:
for a certain data p, its K-neighbor data KNNpExpressed as:
Figure BDA0002379794710000041
wherein d ispqThe calculation formula of the euclidean distance, which is the euclidean distance between the samples p and q, is a routine technical means for those skilled in the art, and is not described herein,
Figure BDA0002379794710000051
represents the maximum distance between the sample p and K nearest neighbors, K being a preset value. In short, the K-neighbor data of the data p is a set (excluding p) of K data closest to the data p.
② local density:
for some data p, its local density ρpExpressed as:
Figure BDA0002379794710000052
wherein d ispq、KNNpThe specific meaning of (a) is not specifically stated, and exp denotes a power function with a base of a natural logarithm. In short, the local density of the data p is a sum of a plurality of power functions with a base natural logarithm and exponentiation of the inverse of the euclidean distance between the K-neighbor data of the data p and the data p.
③ distance from high density point:
for a certain data p, its distance δ from the high density pointpExpressed as:
Figure BDA0002379794710000053
wherein, q: rhop<ρqI.e. all data q representing all local densities higher than data p, see ② for a definition of local densities, briefly, the distance between data p and the high density points, i.e. the distance from the data p to the closest point of all points having a local density higher than the local density of data pThe local density of the data p is now positioned as the distance between the data farthest from the data p and the data p.
④ middle distance:
for a certain data p, the middle distance
Figure BDA0002379794710000054
Expressed as:
Figure BDA0002379794710000055
wherein the content of the first and second substances,
Figure BDA0002379794710000056
represents the maximum distance between the sample p and the K nearest neighbors, and
Figure BDA0002379794710000057
representing the minimum distance between sample p and K nearest neighbors. In short, the median distance of the data p is the average of the maximum distance and the minimum distance of p and its K nearest neighbors.
⑤ support the coefficients:
for any data p and some K-neighbor data q thereofj∈KNNp,qjSupport coefficient for p
Figure BDA0002379794710000061
Is recorded as:
Figure BDA0002379794710000062
wherein the content of the first and second substances,
Figure BDA0002379794710000063
it can be understood as the inverse of the distance between two data,
Figure BDA0002379794710000064
i.e. data qjAll K-neighbor data and data qiThe sum of the inverse of the distance between.
⑥ quality function:
for any data p and some K-neighbor data q thereofj∈KNNpIf q isjAssigned, the label of the cluster is CiThen x (q)j) If q is 1jNot allocated, then χ (q)j) When q is equal to 0jMass function to p
Figure BDA0002379794710000065
Is recorded as:
Figure BDA0002379794710000066
Θ means a set of all clusters, i.e., { C ═ C1,C2,...,CkThat is, for each sample data p, the K-neighbor data of each sample data p may construct a quality function for the data p, that is, p corresponds to K quality functions, and based on the fusion rule in theory, the K quality functions may be fused into one, which is described in detail below.
⑦ fusion quality function:
fusing K pieces of quality functions of data p to obtain a fused quality function m of ppThe specific fusion calculation formula is as follows:
Figure BDA0002379794710000067
wherein
Figure BDA0002379794710000071
⑧ probability conversion:
for the fusion quality function of data p, when i is 1,2, and k in sequence, it can be determined that data p belongs to cluster CiProbability of (2)
Figure BDA0002379794710000072
Figure BDA0002379794710000073
Now the distribution probability of the data p to the clusters is composed
Figure BDA0002379794710000074
Among them, the formulas ⑤ - ⑧ are mainly applied to the steps shown in fig. 5.
As shown in fig. 1, a flowchart of steps of a data clustering method provided in an embodiment of the present invention specifically includes the following steps:
step S102, according to a preset core data determination rule, processing a plurality of clustering center data respectively to generate a plurality of core data sets corresponding to the clustering center data.
In the embodiment of the present invention, the plurality of cluster center data are determined by processing the data to be clustered through a density peak value clustering algorithm (DPCA algorithm), that is, the clustering algorithm provided by the present invention can be understood as an improved DPCA algorithm, and the specific content of the improvement lies in the processing after the cluster center data are determined. For a specific process of determining cluster center data, please refer to the following fig. 2 and the description thereof.
In an embodiment of the present invention, the core data set includes first core data and all data whose distance from the first core data is smaller than a preset core distance threshold. The first core data may be any data in the core data set, that is, for any data in the core data set, all data satisfying a distance smaller than a preset core distance threshold from the first core data are also located in the core data set.
In the embodiment of the present invention, as for any data in the core data set, all data that satisfies that the distance to any data in the core data set is smaller than the preset core distance threshold are located in the core data set, it can be equivalently understood that the distance between any two data in the core data set is close enough, that is, the data in the core data set has a very high probability of being located in the same cluster, which is the effect to be achieved by the preset core data determination rule of the present invention, and the specific implementation of the above effect, that is, there are numerous processes for generating a plurality of core data sets in which the data in the sets have a very high probability of being located in the same cluster, according to claim 1 of the present invention, no limitation is made on the specific implementation process, and any core data determination rule that can achieve the above effect is within the scope claimed by the present invention. Furthermore, the present invention also provides a specific core data determination rule, and the specific determination rule refers to the contents of claim 4 and the explanation thereof.
As a preferred embodiment of the present invention, the core distance threshold is selected from a middle distance or a distance related to the middle distance, wherein for the definition of the middle distance of the data, please refer to the aforementioned point ④.
And step S104, processing the boundary data according to a preset boundary data processing rule, respectively determining a core data set associated with each boundary data, and respectively adding each boundary data into a core data set corresponding to the boundary data.
In the embodiment of the present invention, the boundary data refers to data to be clustered which is not included in any core data set.
In the embodiment of the present invention, it is indicated that the data that is not added to the core data set in step S102 does not definitely belong to a certain core data set, and therefore, the boundary data needs to be further processed by using a corresponding processing rule, and then the boundary data is added to the corresponding core data set. The invention does not limit the specific boundary data processing rule, and all rules and algorithms capable of establishing the corresponding relation between the boundary data and the core data set are feasible, namely the invention classifies the boundary data depending on a cluster consisting of a plurality of data points, thereby solving the technical problem that the classification of the residual central point depends on a cluster of another point in the prior art, so that the clustering effect is not stable enough.
In the embodiment of the present invention, a specific boundary data processing rule is further provided, and refer to fig. 5 and the explanation thereof.
And S106, performing cluster division on the data to be clustered according to the core data set.
In the embodiment of the present invention, for the set of data added to a core data set in step S102, it is noted that the core data set is the cluster CiIs POS (C)i) It is understood that the set of data added to a core data set in step S104 is denoted as the cluster C, while the cluster belongs to the cluster exactly or with a high probabilityiBoundary domain BND (C)i) Although they also belong to the cluster, they are distributed in the edge region of the cluster, possibly also as members of other clusters. That is, the true classification of a cluster is between the positive domain of the cluster and the union of the positive and boundary domains of the cluster, i.e., Ci′=[POS(Ci),POS(Ci)∪BND(Ci)]。
According to the data clustering method provided by the embodiment of the invention, a plurality of clustering centers are respectively processed according to a preset core data determination rule to obtain a plurality of core data sets corresponding to the data of the clustering centers, wherein the data of the clustering centers are determined by processing the data to be clustered through a density peak value clustering algorithm, each core data set comprises first core data and all data of which the distance from the first core data is smaller than a preset core distance threshold value, after the core data sets are determined, the rest data, namely boundary data, are respectively divided into possible core data sets, each core data set can be understood as a classification cluster, and the clustering of the data to be clustered can be directly realized according to the core data sets. The data clustering method of the invention firstly determines a core data set which takes clustering center data as a core and comprises a plurality of core data through a preset core data method, and each core data set meets the following requirements: for any data in the core data set, all data, the distance between which and the core data is less than the preset core distance threshold, are also in the core data set, that is, it can be understood that all data in each core data set are close enough and have a very high probability of belonging to the same cluster, and then the remaining boundary data are divided into clusters that may belong to.
As shown in fig. 2, a flowchart of the step of determining cluster center data provided in the embodiment of the present invention specifically includes the following steps:
step S202, calculating the local density of each data to be clustered and the distance between each data to be clustered and a high-density point.
In the embodiment of the present invention, for the explanation of the local density and the distance between the local density and the high density point and the specific calculation formula, please refer to the explanation in items ② and ③, which are not repeated herein.
Step S204, determining all data to be clustered, of which the local density is greater than a preset first density threshold and the distance between the data to be clustered and the high-density point is greater than a preset distance threshold, as clustering center data.
In the embodiment of the present invention, the definition of the cluster center data in the DPCA algorithm is that the data point has both a higher local density and a larger distance from the high-density point, and therefore, by presetting a density threshold and a distance threshold, the data point whose local density is higher than the density threshold and whose distance from the high-density point is higher than the distance threshold is directly determined as the cluster center data.
As shown in fig. 3, another flow chart of the step of determining cluster center data provided in the embodiment of the present invention is different from the flow chart of the step of determining cluster center data shown in fig. 2, and further includes:
step S302, determining all data to be clustered, of which the local density is smaller than a preset second density threshold and the distance between the local density and the high density point is greater than a preset distance threshold, as noise data.
In the embodiment of the invention, considering the situation that the clustering data may have wrong data, namely noise points, caused by statistical errors or other factors, if the data are not removed during clustering, the clustering effect is influenced. Since the noise points are usually isolated and located at the periphery, that is, have a smaller local density and a larger distance from the high-density points, all the data to be clustered, which have a local density smaller than a preset second density threshold and a distance from the high-density points larger than a preset distance threshold, are determined as the noise data. And in the subsequent data processing process, noise data can be removed from the data set to be clustered so as to improve the clustering effect.
As shown in fig. 4, a flowchart of the step of processing the plurality of cluster center data according to the preset core data determination rule according to the embodiment of the present invention specifically includes the following steps:
step S402, processing the first data added into the core data set associated with the cluster center data based on a core distance algorithm, and determining second data meeting a preset processing condition.
In the embodiment of the present invention, in the first processing, the first data added to the core data set associated with the cluster center data is cluster center data.
In the embodiment of the present invention, the step of processing the data to be processed based on the core distance algorithm specifically includes: determining core data from the K nearest neighbor data of the data to be processed, wherein the core data refers to data, the distance between which and the data to be processed is smaller than a preset core distance threshold value. Wherein only the first data need to be understood as the data interface to be processed.
In this embodiment of the present invention, the second data meeting the preset processing condition refers to core data that is not in the core data set.
Step S404, adding the second data into the core data set, and determining the second data as the first data added into the core data set associated with the cluster center data.
In this embodiment of the present invention, after the second data is added to the core data set again and is determined as the first data again, the process returns to step S402 to perform the processing again, that is, step S402 and step S404 may be understood as a process of loop processing, until after a certain time of processing, the second data meeting the preset processing condition no longer exists, and the loop is ended, at this time, the current core data set meets a condition that the current core data set includes the first core data and all data whose distance from the first core data is smaller than the preset core distance threshold.
In the embodiment of the present invention, in order to facilitate understanding of the process of processing the plurality of cluster center data respectively provided by the present invention, a simple example will be explained below.
For cluster center data point A and other data to be clustered B, C, D, E, F, G, etc., first adding cluster center data point A to a core data set associated with point A and determining as first data, determining that the core data of A includes B and C based on a core distance algorithm (please refer to the above description for the calculation process), since B and C are not currently in the core data set associated with point A, adding B and C to the core data set associated with point A and determining as first data, then determining that the core data of B includes A and C using the core distance algorithm for B and C, respectively, and the core data of C includes A and D, at this time, A, C is already in the core data set associated with point A, only D is second data satisfying a predetermined processing condition, adding D to the core data set associated with point A while continuing to process D, when it is finally determined that the core data of D includes B and C, the second data having the predetermined processing condition is more cyclic ending, at this time, the core data including A is more accurate as the threshold distance between the core data sets, the threshold distance between the data sets is less accurate, and the threshold distance between the data sets is preferably equal to the threshold values of the data sets, and the threshold values of the data sets are obtained for example, the most accurate, the threshold values of the data sets including the data sets, the threshold values of the data sets that the threshold values are obtained for the data sets that the threshold values of the data sets that are more accurate, the data sets that the data sets include the threshold values are included in which are included in the threshold values that the threshold values are included in the threshold values that are included in the data sets that are included in the threshold values that are included in the threshold values are included in the data sets that are included in the data.
As shown in fig. 5, a flowchart of the steps of processing boundary data according to a preset boundary data processing rule and determining a core data set associated with each boundary data respectively according to an embodiment of the present invention specifically includes the following steps:
in step S502, K nearest neighbor data of the first boundary data are determined.
In the embodiment of the present invention, it is necessary to process each boundary data at the same time, that is, to determine K nearest neighbor data of each boundary data, but for simplicity of description, each boundary data is replaced with the first boundary data.
Step S504, determining probabilities that the first boundary data belongs to each core data set according to the associated core data sets of the K nearest neighbor data and the distance between the first boundary data and the first boundary data, respectively.
In the embodiment of the present invention, the specific calculation process sequentially includes:
determining support coefficients of the K nearest neighbor data of the first boundary data to the first boundary data according to the associated core data set of the K nearest neighbor data and the distance between the K nearest neighbor data and the first boundary data, as explained in the aforementioned item ⑤;
determining a quality function of the K nearest neighbor data of the first boundary data to the first boundary data according to the support coefficients of the K nearest neighbor data of the first boundary data to the first boundary data and the core data set of the K nearest neighbor data, which is explained in the aforementioned item ⑥;
fusing the K nearest neighbor data of the first boundary data with the quality function of the first boundary data, as explained in the aforementioned item ⑦;
the probability distribution of the first boundary data for each cluster (i.e., core data set) is determined according to the fused quality function, as explained in the aforementioned item ⑧.
Step S506, determining whether the probability that any boundary data belongs to each core data set is not less than a preset allocation threshold. If yes, go to step S508; if not, go to step S510.
In the embodiment of the present invention, a probability distribution may be obtained for each boundary data, a matrix may be determined by combining the probability distributions obtained for each boundary data, and when all data in the matrix are smaller than a preset allocation threshold, it indicates that each boundary data is already divided into clusters to which the boundary data most probably belongs at this time, and the loop is ended. When certain data in the matrix is not smaller than the preset distribution threshold, it indicates that there may be certain boundary data that is not divided into the most probable clusters at this time, and further adjustment is needed, and considering that each boundary data is affected by its K-neighbor data, all boundary data need to be adjusted again.
Step S508, re-determining the associated core data set of the first boundary data according to the probability that the first boundary data belongs to each core data set, and returning to the step S504.
In the embodiment of the present invention, the associated core data set of the first boundary data is re-determined according to the probability that each data belongs to each core data set, and then the process returns to step S504 to perform a loop.
Step S510, determining an associated core data set of each boundary data.
In the embodiment of the present invention, the determined associated core data set of each piece of boundary data is the core data set to which each piece of boundary data most likely belongs.
As shown in fig. 6, another step flowchart for processing boundary data according to a preset boundary data processing rule and determining a core data set associated with each boundary data respectively is provided in the embodiment of the present invention, and the difference between the step flowchart shown in fig. 5 is that:
in the step S508, the associated core data set of the first boundary data is re-determined according to the probability that the first boundary data belongs to each core data set, and the step S504 is returned to, specifically:
step S602, associating the core data set corresponding to the maximum value of the probability that the first boundary data belongs to each core data set with the first boundary data.
In the embodiment of the present invention, since the distribution probability side of the data reflects the probability that the data belongs to the corresponding cluster, the greater the distribution probability of the data for a certain cluster, the more likely the data belongs to the cluster, and therefore, the boundary data may be divided into the corresponding core data sets, that is, the core data set corresponding to the maximum value of the probability that the first boundary data belongs to each core data set is associated with the first boundary data.
As shown in fig. 7, a flowchart of still another step for processing boundary data according to a preset boundary data processing rule and determining a core data set associated with each boundary data respectively provided in the embodiment of the present invention is different from the flowchart of the step shown in fig. 6 in that:
in step S508, after re-determining the associated core data set of the first boundary data according to the probability that the first boundary data belongs to each core data set, the method further includes:
step S702 determines whether an overlapped core data set exists. When the judgment is yes, step S704 is performed; and when the judgment is no, executing other steps.
In this embodiment of the present invention, the overlapped core data set is a core data set that satisfies a condition that a difference between a probability that the first boundary data belongs to the overlapped core data set and a maximum value of the probability is smaller than a preset overlap threshold, except for a core data set corresponding to the maximum value of the probability.
The step S704 is to associate the first boundary data with the overlapped core data set.
In the embodiment of the present invention, considering that the greater the distribution probability of data for a certain cluster, the more likely the data belongs to the cluster, and if there is a smaller difference between the distribution probability of another cluster and the maximum probability, the data also has a certain probability to belong to the cluster, the data may be added to the cluster at the same time, and the most likely associated core data set may be further determined in the subsequent processing process.
In order to further show the clustering effect of the data clustering method provided by the invention, the clustering effect of the invention is evaluated by specific sample data, and mainly a conventional DPCA clustering algorithm and a clustering algorithm (3W-DPET algorithm for short) provided by the invention are respectively adopted for clustering aiming at 8 special-shaped data sets and 8 artificial data sets, and 3 effective clustering indexes are simultaneously adopted to evaluate the clustering performance: the accuracy ACC, the landed index ARI and the standard mutual information NMI are adjusted, and the three validity clustering indexes belong to the conventional technical means of those skilled in the art, and are not described herein again. Wherein the basic information of the data set used in the experimental procedure is as follows:
table 1: data set information
Figure BDA0002379794710000151
Figure BDA0002379794710000161
For each data set, the performance indexes of clustering by adopting the clustering method disclosed by the invention and the conventional DPCA clustering method are as follows:
table 2: clustering performance comparison
Figure BDA0002379794710000162
The experimental data are combined to show that except the Credit data set, the 3W-DPET clustering algorithm disclosed by the invention has certain performance improvement on the accuracy, the adjusted landed index and the standard mutual information compared with the conventional DPCA.
Fig. 8 is a schematic structural diagram of a data clustering device according to an embodiment of the present invention, which is described in detail below.
In the embodiment of the present invention, the data clustering apparatus specifically includes a core data determining module 810, a boundary data processing module 820, and a data clustering module 830.
The core data determining module 810 is configured to respectively process the determined multiple pieces of clustering center data according to a preset core data determining rule, and generate multiple core data sets corresponding to the multiple pieces of clustering center data.
In the embodiment of the present invention, the plurality of cluster center data are determined by processing the data to be clustered through a density peak value clustering algorithm (DPCA algorithm), that is, the clustering algorithm provided by the present invention can be understood as an improved DPCA algorithm, and the specific content of the improvement lies in the processing after the cluster center data are determined.
In an embodiment of the present invention, the core data set includes first core data and all data whose distance from the first core data is smaller than a preset core distance threshold. The first core data may be any data in the core data set, that is, for any data in the core data set, all data satisfying a distance smaller than a preset core distance threshold from the first core data are also located in the core data set.
In the embodiment of the present invention, as for any data in the core data set, all data that satisfies that the distance to any data in the core data set is smaller than the preset core distance threshold are located in the core data set, it can be equivalently understood that the distance between any two data in the core data set is close enough, that is, the data in the core data set has a very high probability of being located in the same cluster, which is the effect to be achieved by the preset core data determination rule of the present invention, and the specific implementation of the above effect, that is, there are numerous processes for generating a plurality of core data sets in which the data in the sets have a very high probability of being located in the same cluster, according to claim 1 of the present invention, no limitation is made on the specific implementation process, and any core data determination rule that can achieve the above effect is within the scope claimed by the present invention.
As a preferred embodiment of the present invention, the core distance threshold is selected from a middle distance or a distance related to the middle distance, wherein for the definition of the middle distance of the data, please refer to the aforementioned point ④.
The boundary data processing module 820 is configured to process the boundary data according to a preset boundary data processing rule, determine a core data set associated with each boundary data, and add each boundary data to the core data set corresponding to the boundary data.
In the embodiment of the present invention, the boundary data refers to data to be clustered which is not included in any core data set.
In the embodiment of the present invention, for data that is not added to the core data set in the core data determining module 810, it indicates that the data does not definitely belong to a certain core data set, and therefore, the boundary data needs to be further processed by using corresponding processing rules, and then the boundary data is added to the corresponding core data set. The invention does not limit the specific boundary data processing rule, and all rules and algorithms capable of establishing the corresponding relation between the boundary data and the core data set are feasible, namely the invention classifies the boundary data depending on a cluster consisting of a plurality of data points, thereby solving the technical problem that the classification of the residual central point depends on a cluster of another point in the prior art, so that the clustering effect is not stable enough.
The data clustering module 830 performs cluster division on the data to be clustered according to the core data set.
In the embodiment of the present invention, for a set of data added to a core data set in the core data determination module 810, it is noted that the core data set is the cluster CiIs POS (C)i) It is understood that the cluster C is a set of data added to a core data set in the boundary data processing module 820iBoundary domain BND (C)i) Although they also belong to the cluster, they are distributed in the edge region of the cluster, possibly also as members of other clusters. That is, the true classification of a cluster is between the positive domain of the cluster and the union of the positive and boundary domains of the cluster, i.e., Ci′=[POS(Ci),POS(Ci)∪BND(Ci)]。
According to the data clustering device provided by the embodiment of the invention, a plurality of clustering centers are respectively processed according to a preset core data determination rule to obtain a plurality of core data sets corresponding to the clustering center data, wherein the plurality of clustering center data are determined by processing the data to be clustered through a density peak value clustering algorithm, each core data set comprises first core data and all data, the distance between the first core data and the data is smaller than a preset core distance threshold value, after the core data sets are determined, the rest data, namely boundary data, are respectively divided into possible core data sets, each core data set can be understood as a classification cluster, and the clustering of the data to be clustered can be directly realized according to the core data sets. The data clustering device of the invention firstly determines a core data set which takes clustering center data as a core and comprises a plurality of core data through a preset core data method, and each core data set meets the following requirements: for any data in the core data set, all data, the distance between which and the core data is less than the preset core distance threshold, are also in the core data set, that is, it can be understood that all data in each core data set are close enough and have a very high probability of belonging to the same cluster, and then the remaining boundary data are divided into clusters that may belong to.
In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
according to a preset core data determination rule, respectively processing a plurality of clustering center data to generate a plurality of core data sets corresponding to the clustering center data; the plurality of clustering center data are determined by processing data to be clustered through a density peak clustering algorithm, and the core data set comprises first core data and all data, the distance between the first core data and the core data is smaller than a preset core distance threshold;
processing the boundary data according to a preset boundary data processing rule, respectively determining a core data set associated with each boundary data, and respectively adding each boundary data into a core data set corresponding to the boundary data; the boundary data refers to data to be clustered which are not contained in any core data set;
and clustering and dividing the data to be clustered according to the core data set.
In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which, when executed by a processor, causes the processor to perform the steps of:
according to a preset core data determination rule, respectively processing a plurality of clustering center data to generate a plurality of core data sets corresponding to the clustering center data; the plurality of clustering center data are determined by processing data to be clustered through a density peak clustering algorithm, and the core data set comprises first core data and all data, the distance between the first core data and the core data is smaller than a preset core distance threshold;
processing the boundary data according to a preset boundary data processing rule, respectively determining a core data set associated with each boundary data, and respectively adding each boundary data into a core data set corresponding to the boundary data; the boundary data refers to data to be clustered which are not contained in any core data set;
and clustering and dividing the data to be clustered according to the core data set.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A method for clustering data, comprising:
according to a preset core data determination rule, respectively processing a plurality of clustering center data to generate a plurality of core data sets corresponding to the clustering center data; the plurality of clustering center data are determined by processing data to be clustered through a density peak clustering algorithm, and the core data set comprises first core data and all data, the distance between the first core data and the core data is smaller than a preset core distance threshold;
processing the boundary data according to a preset boundary data processing rule, respectively determining a core data set associated with each boundary data, and respectively adding each boundary data into a core data set corresponding to the boundary data; the boundary data refers to data to be clustered which are not contained in any core data set;
and clustering and dividing the data to be clustered according to the core data set.
2. The data clustering method of claim 1, wherein the determining of the plurality of cluster center data comprises:
calculating the local density of each data to be clustered and the distance between each data to be clustered and the high-density point;
and determining all data to be clustered, of which the local density is greater than a preset first density threshold and the distance between the local density and the high-density point is greater than a preset distance threshold, as clustering center data.
3. The data clustering method according to claim 2, further comprising:
and determining all data to be clustered, of which the local density is smaller than a preset second density threshold and the distance between the local density and the high-density point is larger than a preset distance threshold, as noise data.
4. The data clustering method according to claim 1, wherein the step of processing the plurality of clustering center data according to the preset core data determination rule comprises:
processing first data added into a core data set associated with the clustering center data based on a core distance algorithm, and determining second data meeting a preset processing condition;
adding the second data into the core data set, and determining as first data added into the core data set associated with the cluster center data; wherein
When the first processing is carried out, first data added into a core data set associated with the clustering center data is clustering center data;
the step of processing the data to be processed based on the core distance algorithm specifically comprises the following steps:
determining core data from K nearest neighbor data of the data to be processed, wherein the core data are data with a distance smaller than a preset core distance threshold value from the data to be processed;
the second data meeting the preset processing condition refers to core data which is not in the core data set.
5. The data clustering method according to claim 1, wherein the step of processing the boundary data according to a preset boundary data processing rule and determining the core data set associated with each boundary data respectively specifically comprises:
determining K nearest neighbor data of the first boundary data;
respectively determining the probability that the first boundary data belongs to each core data set according to the associated core data sets of the K nearest neighbor data and the distance between the first boundary data and the associated core data sets of the K nearest neighbor data;
when the probability that any boundary data belongs to each core data set is judged to be not smaller than a preset distribution threshold, re-determining the associated core data set of the first boundary data according to the probability that the first boundary data belongs to each core data set, and returning to the step of respectively determining the probability that the first boundary data belongs to each core data set according to the associated core data sets of the K nearest neighbor data and the distance between the first boundary data and the associated core data sets of the K nearest neighbor data;
and when the probability that any boundary data belongs to each core data set is judged to be smaller than a preset distribution threshold, determining the associated core data set of each boundary data.
6. The data clustering method according to claim 5, wherein the step of re-determining the associated core data set of the first boundary data according to the probability that the first boundary data belongs to each core data set specifically comprises:
and associating the core data set corresponding to the maximum value of the probability that the first boundary data belongs to each core data set with the first boundary data.
7. The data clustering method of claim 6, further comprising:
and when judging that an overlapped core data set exists, associating the first boundary data with the overlapped core data set, wherein the overlapped core data set is a core data set which meets the condition that the difference value between the probability that the first boundary data belongs to the overlapped core data set and the maximum value of the probability is smaller than a preset overlap threshold except the core data set corresponding to the maximum value of the probability.
8. A data clustering apparatus, comprising:
the core data determining module is used for respectively processing the determined multiple clustering center data according to a preset core data determining rule to generate multiple core data sets corresponding to the multiple clustering center data, the multiple clustering center data are determined by processing the data to be clustered through a density peak value clustering algorithm, and the multiple core data sets comprise first core data and all data, the distance between the first core data and the data is smaller than a preset core distance threshold value;
the boundary data processing module is used for processing boundary data according to a preset boundary data processing rule, respectively determining a core data set associated with each boundary data, and respectively adding each boundary data into a core data set corresponding to the boundary data, wherein the boundary data refers to data to be clustered which is not contained in any core data set;
and the data clustering module is used for clustering and dividing the data to be clustered according to the core data set.
9. A computer arrangement comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the data clustering method according to any one of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the steps of the data clustering method according to any one of claims 1 to 7.
CN202010079585.6A 2020-02-04 2020-02-04 Data clustering method and device, computer equipment and storage medium Pending CN111310809A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010079585.6A CN111310809A (en) 2020-02-04 2020-02-04 Data clustering method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010079585.6A CN111310809A (en) 2020-02-04 2020-02-04 Data clustering method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111310809A true CN111310809A (en) 2020-06-19

Family

ID=71150893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010079585.6A Pending CN111310809A (en) 2020-02-04 2020-02-04 Data clustering method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111310809A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469242A (en) * 2021-06-29 2021-10-01 深圳市瑞立视多媒体科技有限公司 Multithreading-based clustering data processing method and data processing equipment
CN113836373A (en) * 2021-01-20 2021-12-24 国义招标股份有限公司 Bidding information processing method and device based on density clustering and storage medium
WO2022063150A1 (en) * 2020-09-27 2022-03-31 阿里云计算有限公司 Data storage method and device, and data query method and device
CN116910458A (en) * 2023-09-14 2023-10-20 自然资源部第一海洋研究所 Satellite-borne laser radar data denoising method based on improved density peak value clustering

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022063150A1 (en) * 2020-09-27 2022-03-31 阿里云计算有限公司 Data storage method and device, and data query method and device
CN113836373A (en) * 2021-01-20 2021-12-24 国义招标股份有限公司 Bidding information processing method and device based on density clustering and storage medium
CN113469242A (en) * 2021-06-29 2021-10-01 深圳市瑞立视多媒体科技有限公司 Multithreading-based clustering data processing method and data processing equipment
CN116910458A (en) * 2023-09-14 2023-10-20 自然资源部第一海洋研究所 Satellite-borne laser radar data denoising method based on improved density peak value clustering
CN116910458B (en) * 2023-09-14 2024-01-09 自然资源部第一海洋研究所 Satellite-borne laser radar data denoising method based on improved density peak value clustering

Similar Documents

Publication Publication Date Title
CN111310809A (en) Data clustering method and device, computer equipment and storage medium
US20180107933A1 (en) Web page training method and device, and search intention identifying method and device
US20180365218A1 (en) Text information clustering method and text information clustering system
EP3709184A1 (en) Sample set processing method and apparatus, and sample querying method and apparatus
US9053386B2 (en) Method and apparatus of identifying similar images
CN107463548B (en) Phrase mining method and device
CN107784110B (en) Index establishing method and device
CN110874417B (en) Data retrieval method and device
CN113934830A (en) Text retrieval model training, question and answer retrieval method, device, equipment and medium
CN114245896A (en) Vector query method and device, electronic equipment and storage medium
WO2022227217A1 (en) Text classification model training method and apparatus, and device and readable storage medium
US20210263903A1 (en) Multi-level conflict-free entity clusters
CN112559526A (en) Data table export method and device, computer equipment and storage medium
CN113627159B (en) Training data determining method, device, medium and product of error correction model
CN113849523A (en) Data query method, equipment and medium
US11868899B2 (en) System and method for model configuration selection preliminary class
CN115759027B (en) Text data processing system and method
CN115238788A (en) Industrial big data classification method, system, server and storage medium
CN114169331A (en) Address resolution method, device, computer equipment and storage medium
Zhou et al. A novel locality-sensitive hashing algorithm for similarity searches on large-scale hyperspectral data
CN111737461A (en) Text processing method and device, electronic equipment and computer readable storage medium
US11210605B1 (en) Dataset suitability check for machine learning
CN111639099A (en) Full-text indexing method and system
CN109508780A (en) A kind of feature selection approach, device and computer storage medium for high dimensional data
CN117171141B (en) Data model modeling method based on relational graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200619

RJ01 Rejection of invention patent application after publication