CN112132217B - Classification type data clustering method based on inter-cluster dissimilarity in clusters - Google Patents
Classification type data clustering method based on inter-cluster dissimilarity in clusters Download PDFInfo
- Publication number
- CN112132217B CN112132217B CN202011009696.6A CN202011009696A CN112132217B CN 112132217 B CN112132217 B CN 112132217B CN 202011009696 A CN202011009696 A CN 202011009696A CN 112132217 B CN112132217 B CN 112132217B
- Authority
- CN
- China
- Prior art keywords
- cluster
- data
- dissimilarity
- data object
- clusters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a classification type data clustering method based on inter-cluster dissimilarity among clusters, which provides a new dissimilarity calculating method based on inter-cluster similarity among clusters, and completes automatic selection of a cluster center based on the dissimilarity. The dissimilarity degree of the invention reserves the characteristics of data, achieves the standard of dissimilarity between low clusters and high clusters, improves the clustering precision, purity and recall rate, effectively improves the clustering effect of the classified data, can prevent the loss of important characteristic values in the clustering process, strengthens the similarity between the characteristic values in the clusters, and weakens the similarity between the characteristic values in the clusters; by the automatic cluster center selection method, errors caused by randomly selecting the cluster center or manually selecting the cluster center to cluster are greatly reduced.
Description
Technical Field
The invention relates to the technical field of data clustering, in particular to a classification type data clustering method based on intra-cluster dissimilarity.
Background
A clustering algorithm is an algorithm in machine learning that involves grouping data. In a given dataset, we can divide it into several different groups by a clustering algorithm. In theory, the data of the same group have the same attribute or characteristic, and the attribute or characteristic difference between the data of different groups is larger. Clustering algorithms are an unsupervised learning algorithm and are applied in many fields as a commonly used data analysis algorithm.
In the field of data science, clustering analysis is realized by using a clustering algorithm, and data information can be acquired more clearly by grouping data. The classification type data clustering algorithm is used as an important component of data mining, can help an analyst summarize the disease characteristics of each type of patient from a database of a rehabilitation therapy scheme recommendation system, and enables the analyst to pay attention to a specific patient group so as to make further analysis.
The classical k-means algorithm uses Euclidean distance when calculating the mean of clusters and the dissimilarity between data objects, and is only applicable to numerical data sets of continuous features, and is no longer applicable to classification data sets of discrete features. Huang extended the k-means algorithm in 1998, using "modes" instead of "means", and proposed a k-mode algorithm suitable for categorized data clustering. The k-mode algorithm adopts a simple Hamming distance to calculate dissimilarity, ignores the dissimilarity of the same classification characteristic among data objects, weakens the similarity in clusters, and does not fully reflect the dissimilarity between two characteristic values under the same classification characteristic, thereby influencing the accuracy of a clustering result. In addition, the k-modes algorithm adopts a random selection method to determine the initial cluster center and the k value, and adopts a frequency-based method to recalculate and update the cluster center, so that great uncertainty is brought to the clustering result.
Disclosure of Invention
The invention aims to solve the problems of precision of a k-mode algorithm and center selection of an initial cluster, and provides a classification type data clustering method based on intra-cluster dissimilarity.
In order to solve the problems, the invention is realized by the following technical scheme:
a classification type data clustering method based on intra-cluster dissimilarity comprises the following steps:
step1, for a categorized data set D with n data objects, calculating the dissimilarity D between every 2 data objects using a simple Hamming distance i,j ;
Step2, for each data object x of the categorized dataset D i The data object x is first processed i Degree of dissimilarity d between correspondence with other data i,j Performing ascending sort to obtain the data object x i Is the dissimilarity vector d' i,j =[d′ i,1 ,d′ i,2 ,...,d′ i,n ]The method comprises the steps of carrying out a first treatment on the surface of the And then the dissimilarity is carried outDegree vector d' i,j The maximum difference between two adjacent dissimilarities in a database is taken as a data object x i Is d of the cutting distance of (2) c,i ;
Step3, selecting the cut-off distance D of all data objects in the data set D of the sub-type c,i As the cut-off distance D of the classified dataset D c ;
Step4, cutting distance D based on classified data set D c And calculates each data object x of the data set D of the sub-type by using square wave kernel function method or Gaussian kernel function method i Local neighborhood density ρ of (1) i ;
Step5, calculating each data object x of the data set D of the classification type i The relative distance L of (2) i :
Step6, for each data object x of the categorized dataset D i Using the data object x i Local neighborhood density ρ of (1) i And relative distance L i Obtaining the data object x i Decision diagram Z of (2) i :
Z i =ρ i ×L i
Step7, firstly, the decision diagram Z of all data objects in the classified data set D i Sequencing in a descending order to obtain a sequencing sequence; based on the ordered sequence, data object x i The subscript i of (1) is the abscissa, and the data object x is taken as the abscissa i Decision diagram Z of (2) i Drawing a decision graph of the classified data set D for an ordinate, wherein the abscissa at the inflection point in the decision graph of the classified data set D is the selected cluster number k;
step8, selecting k data objects from the classified data set D to form a current cluster center set;
step9, calculating the remaining n-k data objects x of the classified data set D based on the current cluster center set i And k cluster centers q l Degree of dissimilarity d (x) i ,q l ):
Step10, according to the data object x i And cluster center q l Degree of dissimilarity d (x) i ,q l ) The method comprises the steps of distributing n-k data objects into clusters nearest to the n-k data objects based on a near principle, obtaining k clustering clusters after distribution is completed, and marking cluster labels of the n-k data objects, so that a clustering result based on a current cluster center set is obtained;
step11, selecting the feature value with the highest occurrence frequency on each dimension feature from each cluster for forming a new cluster center of the cluster for the formed k cluster clusters, so as to obtain a new cluster center set;
step12, repeating the steps 9-11 until the center of each cluster is not changed or the specified maximum iteration number is reached, terminating the algorithm, and outputting a clustering result based on the current cluster center set; otherwise, taking the obtained new cluster center set as the current cluster center set, and jumping to the step9 to continue iteration;
the iteration causes the selected cluster center to be more and more close to the real cluster center, so that the clustering effect is better and better in the iteration process. The clustering algorithm ending condition can be specifically selected by an experimenter according to actual conditions: (1) terminating the iteration until the maximum number of iterations is reached; (2) iterating to the objective function threshold termination.
Where i, j=1, 2, …, n, n is the number of data objects of the categorized dataset D; s=1, 2, …, m, m is the number of features of the data object; l=1, 2, …, k is the number of clusters; delta (A) i,s ,A ql,s ) For data object x under s-th dimension characteristic i With the center q of the cluster l Is the degree of dissimilarity of (3); a is that i,s Data object x i Is the s-th dimensional feature of (2); a is that ql,s Is a cluster center q l Is the s-th dimensional feature of (2);is cluster C l In, the characteristic value is A s,t The number of data objects; c l I is cluster C l The number of data objects within; zeta type l For adjusting the coefficients.
In the step4, for the large-scale classified data set D with the data object being greater than or equal to 10TB, the square wave kernel function method is used for calculating the data object x i Local neighborhood density ρ of (1) i The method comprises the steps of carrying out a first treatment on the surface of the For small-scale classified data set D with data object less than 10TB, the data object x is calculated by Gaussian kernel function method i Local neighborhood density ρ of (1) i . And (3) injection: large-scale data generally refers to data amounts above a 10TB (1tb=1024 GB) scale.
Compared with the prior art, the invention has the following characteristics:
1. the invention provides a new dissimilarity degree calculating method based on the similarity among clusters, which can prevent the loss of important characteristic values in the clustering process, strengthen the similarity among the characteristic values in the clusters and weaken the similarity among the characteristic values among the clusters.
2. The automatic cluster center selection method provided by the invention greatly reduces errors caused by randomly selecting the cluster center or manually selecting the cluster center.
3. The dissimilarity coefficient calculation method provided by the invention maintains the characteristics of data, achieves the standard of dissimilarity between low clusters and high clusters, improves the clustering precision, purity and recall rate, and effectively improves the clustering effect of the classified data.
Drawings
Fig. 1 is a schematic diagram of the sensitivity of the k-modes algorithm to initial cluster center selection, (a) a cluster result graph of k=1, (b) a cluster result graph of k=2, and (c) a cluster result graph of k=3.
FIG. 2 is x i A case where the local neighborhood density of (c) is not the maximum density.
FIG. 3 is x i Is a case diagram when the local neighborhood density of (a) is the maximum density.
FIG. 4 d c,i A deterministic graph of values.
FIG. 5 is a schematic diagram of a two-dimensional dataset.
Fig. 6 is a decision diagram.
FIG. 7 is Z i Decision diagram。
Fig. 8 is an IKMCA flowchart.
Detailed Description
The present invention will be further described in detail with reference to specific examples in order to make the objects, technical solutions and advantages of the present invention more apparent.
The relevant symbols and meanings used are shown in Table 1.
Table 1 symbol illustrates
In data object x i And cluster center q l For example, define a simple hamming distance for the classical k-modes algorithm, as shown in equation (1), this calculation gives the same weight to each feature.
Wherein:
the k-modes algorithm minimizes the objective function by a simple hamming distance. As shown in formula (2):
the dissimilarity coefficient of the classical k-modes algorithm does not take into account the relative frequency of occurrence of intra-cluster feature values, nor the intra-cluster inter-cluster structure of each feature, among the dissimilarity coefficients. Resulting in some clusters being assigned less similar data during the new data object partitioning process. For convenience of explanation, the artificial dataset D shown in Table 2 was used 1 The dissimilarity coefficient is demonstrated. D (D) 1 Description of a= { a by three features 1 ,A 2 ,A 3 }. Wherein DOM (A) 1 )={A,B},DOM(A 2 )={E,F},DOM(A 3 )={H,I}。D 1 There are two clusters C 1 And C 2 Respectively corresponding cluster center q 1 (A, E, H) and q 2 (A,E,H)。
Table 2 artificial dataset D 1
Suppose that the pair x is needed 7 = (a, E, H) clustering, d (x) is obtained using simple full-name distance 7 ,q 1 )=d(x 7 ,q 2 ) =0+0+0=0. But in terms of cluster similarity, x should be 7 Grouping into clusters C 1 。
In the initial cluster center selection, the classical k-modes algorithm is very sensitive to the initial cluster center, and the initial cluster center selection adopts a random initialization method or a manual setting method, and the clustering result is unstable to a certain extent by the two methods. Selecting initial cluster centers at different locations and k values will produce different clustering results. As shown in fig. 1, the real cluster number of the data set is 3. Different initial cluster centers are selected, different k values are set, different clustering results can be generated, and the content of fig. 1 is as follows from left to right: and randomly selecting an initial cluster center, clustering an iterative process and finally clustering a result. It can be seen that finding the appropriate initial cluster center is important.
If the selected dissimilarity coefficient can find all or part of potential modes in the data set, the k-modes algorithm based on division is half-time with little effort, so that the k-modes algorithm generates an efficient clustering result and the dissimilarity among the data objects in the cluster is required to be minimum; the condition that the dissimilarity between the inter-cluster data objects is greatest. Therefore, the invention provides a new dissimilarity coefficient 'inter-cluster dissimilarity coefficient' based on the inter-cluster similarity. The k-mode algorithm (IKMCA) algorithm based on intra-cluster inter-cluster dissimilarity uses the modified density peak algorithm to determine an initial cluster center, uses the intra-cluster inter-cluster dissimilarity coefficient to calculate the dissimilarity between each data object and the cluster center, and updates the cluster center.
Intra-cluster inter-cluster dissimilarity coefficients consider the relative frequency with which eigenvalues are distributed within the same cluster. Data objects belonging to the same cluster have higher frequency of occurrence of the same eigenvalue and higher similarity in clusters. Intra-cluster dissimilarity definition as shown in formula (3):
wherein: i is more than or equal to 1 and less than or equal to n, s is more than or equal to 1 and less than or equal to m.
Using dataset D 1 D (x) is obtained according to formula (3) 10 ,q 1 )=(1-2/3)+(1-2/3)+(1-1)=2/3,d(x 10 ,q 2 ) = (1-2/3) + (1-2/3) + (1-2/3) =1. From the calculation result, x 7 Cluster C 1 With minimal dissimilarity, therefore x 7 Should be divided into clusters C 1 And (3) inner part. The formula (3) does not consider the distribution of the intra-cluster eigenvalues, although the relative frequency of the intra-cluster eigenvalues is considered. Using artificial dataset D as shown in table 3 2 The discussion does not consider the defect of similarity between clusters. D (D) 2 Description of a = { a by three classification features 1 ,A 2 ,A 3 }. Wherein dom= (a 1 )={A,B,C},DOM(A 2 )={E,F},DOM(A 3 )={H,I,J}。D 2 There are three clusters C 1 ,C 2 And C 3 Respectively correspond to cluster centers q 1 =(A,E,H),q 2 = (a, E, H) and q 3 =(B,E,I)。
TABLE 3 Artificial dataset D 2
Suppose that the pair x is needed 10 = (a, E, H) for cluster division. Using a simple hamming distance to obtain d (x) 10 ,q 1 )=d(x 10 ,q 2 )=d(x 10 ,q 3 ) =0+0+0=0. D (x) can be obtained using equation (3) 10 ,q 1 )=(1-2/3)+(1-2/3)+(1-3/3)=2/3,d(x 10 ,q 2 )=(1-2/3)+(1-3/3)+(1-2/3)=2/3,d(x 10 ,q 3 ) =1+0+1=2. From the above calculation results, it can be seen that the simple Hamming distance cannot be calculated for x 10 Clustering and dividing are carried out; equation (3) can convert x 10 Grouping into clusters C 1 Or cluster C 2 I.e. equation (3) does not accurately determine x 10 Is defined in the above-mentioned document. Data set D is viewed from a "low cluster-to-cluster dissimilarity high cluster-to-cluster dissimilarity" angle 2 It can be seen that x 10 Assigned to cluster C 1 More suitably. Because x is to 10 Assigned to cluster C 1 Then let cluster C 1 And cluster C 2 The degree of dissimilarity between them is maximized.
The inter-cluster dissimilarity considers the total frequency of eigenvalues relative to all cluster distributions. The assumption that the feature values are frequently distributed only within one cluster means that the difference between the feature values and other clusters is large. The inter-cluster dissimilarity coefficient definition within a cluster is shown in equation (4):
wherein: i is more than or equal to 1 and less than or equal to n, s is more than or equal to 1 and less than or equal to m.
Data set D using equation (4) 2 Calculate d (x) 10 ,q 1 )=(1-2/3×2/4)+(1-2/3×2/8)+(1-3/3×3/5)=1.9;d(x 10 ,q 2 )=(1-2/3×2/4)+(1-3/3×3/8)+(1-2/3×2/5)=2.025;d(x 10 ,q 3 ) = (1-0×1) + (1-3/3×3/8) + (1-0×1) =2.625. As can be seen from the calculation result of the formula (4), x 10 Cluster C 1 The degree of dissimilarity between is smaller, and this result is consistent with previous analysis, successful vs x 10 Cluster partitioning was performed. The more specific artificial dataset D is verified using equation (4) below 3 . As shown in Table 4, D 3 Description of a= { a by three features 1 ,A 2 ,A 3 }. Wherein dom= (a 1 )={A,B},DOM(A 2 )={E,F},DOM(A 3 ) = { H, I }, three clustersC 1 ,C 2 And C 3 Respectively correspond to cluster centers q 1 =(A,E,H),q 2 = (a, E, H) and q 3 = (a, E, H), a, E and H are at D 3 And are uniformly distributed, and all appear 6 times.
Table 4 artificial dataset D 3
The simple Hamming distance, equation (3) and equation (4) are used to pair x, respectively 10 (A, E, H) performing cluster division. Using a simple hamming distance to obtain d (x) 10 ,q 1 )=d(x 10 ,q 2 )=d(x 10 ,q 3 ) =0+0+0=0; d (x) can be obtained using equation (3) 10 ,q 1 )=d(x 10 ,q 2 )=d(x 10 ,q 3 ) = (1-2/3) + (1-2/3) + (1-2/3) =1; d (x) can be obtained using equation (4) 10 ,q 1 )=d(x 10 ,q 2 )=d(x 10 ,q 3 ) = (1-2/3×2/6) + (1-2/3×2/6) =21/9. As can be seen from the calculation results, when the characteristic values are uniformly distributed, all of the three dissimilarity coefficients cannot be correctly aligned with x 10 And carrying out cluster division. So the inter-cluster dissimilarity coefficient within the perfect cluster is again considered.
Fetch data object x i Comparing the characteristic value distribution of the cluster with the integral characteristic value distribution of the cluster, wherein the definition of the perfected part is shown in a formula (5):
wherein: x is x i Is the data object to be divided, x j Is cluster C l A data object within. The redefined intra-cluster inter-cluster dissimilarity coefficient is shown in equation (6):
for any x i, x j E D, D all have the following properties:
distance per se: for all x i Each object is at a distance from itself equal to zero d (x i ,x i )=0。
Symmetry: for all x i And x j ,x i To x j Is equal to x j To x i Distance d (x) i ,x j )=d(x j ,x i )。
Non-negativity: for all x i ,x j The distance d being a non-negative value if and only if x i =x j When d (x) i ,x j )=0。
The triangle inequality is satisfied: for all x i And x j ,d(x i ,x j )≤d(x i ,x h )+d(x h ,x j )。
Re-pair D using equation (6) 3 Performing calculation to obtain Thus, the new dissimilarity coefficient calculation result is d (x 10 ,q 1 )=21/9+ζ 1 =2.6666,d(x 10 ,q 2 )=21/9+ζ 2 =2.8148,d(x 10 ,q 3 )=21/9+ζ 3 = 2.8888. From the above calculation results, it can be seen that x should be calculated 10 Grouping into clusters C 1 . The clustering division result accords with the actual situation, and the dissimilarity coefficient scheme provided by the invention is proved to be feasible.
In 2014, rodriguez et al proposed the Density Peak (DP) algorithm. The DP algorithm is a novel clustering algorithm based on relative distance and local neighborhood density, and numerical data is processed, and the input of the DP algorithm is a dissimilarity matrix among data objects, so that the dissimilarity among data of different types is calculated through a proper dissimilarity coefficient, and the DP algorithm can be applied to clustering of the data of different types. The present section uses the advantage of the DP algorithm to automatically determine the number of clusters to determine the initial cluster center.
Data object x i Local neighborhood density ρ of (1) i The value of the data object x is equivalent to i As the center of a circle, cut off the distance d c Is the number of data objects within the radius area. Data object x i The local neighborhood density of (1) has two definition methods of square wave kernel function method and Gaussian kernel function method:
the square wave kernel function method is suitable for large-scale data sets, and the square wave kernel function method is used for solving rho i Is defined as shown in formula (7):
wherein: d, d i,j -d c When less than or equal to 0, χ (x) =1; otherwise, χ (x) =0.
If the number of objects in the data set is small, calculating ρ by adopting square wave kernel function i The clustering result is inaccurate due to the influence of statistical errors, and a Gaussian kernel function method can be adopted at the moment. Gaussian kernel functions are a common method of density estimation and are widely used in density-based clustering algorithm analysis. Gaussian kernel function method for solving local neighborhood density rho i Is defined as shown in formula (8):
wherein: k (x) =exp { -x 2 }。
From the formula (7) and the formula (8), d c The value of (c) will directly affect ρ i And further affects the selection of cluster centers and overall cluster results. Thus, the appropriate d is determined c The value is important to the algorithm.
Calculation of L according to Alex-Li formula i Value of (2), data pairImage x i And x j Relative distance L between i Is defined as shown in formula (9):
when x is i ρ of (1) i When not at maximum density, L i Defined as the density ratio x in all local neighborhoods i In large data objects, and x i Data object closest to x i The distance between them as shown in equation (10) and fig. 2:
when x is i ρ of (1) i At maximum density, L i Defined as the density ratio x in all local neighborhoods i In large data objects, distance x i Furthest data object and x i The distance between them is shown in formula (11) and fig. 3. At the same time have high L i And a height ρ i The data object of (a) is the cluster center.
Cut-off distance d c Is a threshold defining the distance search range. D of the DP algorithm c The values need to be manually determined, the distances between every two data objects in the data set are arranged in an ascending order, and the value at the position of the first 1% to 2% is d c The value is an approximate range. In the actual clustering problem d c The value setting is too large, resulting in a calculated ρ i Overlap, d c Too small a value setting may result in sparse cluster distribution. The invention gives a detailed description d c A value determining method. Setting d i,j =[d i,1 ,d i,2 ,..,d i,n ]Data object x i And x j Is a degree of dissimilarity of (3). Calculating d by using formula (1) i,j Value of then to d i,j Ascending order to obtain d' i,j =[d′ i,1 ,d′ i,2 ,...,d′ i,n ]。x i Is d of the cutting distance of (2) c,i The definition is as shown in formula (12):
wherein: max (d' i,j+1 -d′ i,j ) Is d' i,j Maximum difference between adjacent dissimilarities.
Set d' i,j =d a ,d′ i,j+1 =d b 12. As shown in fig. 4, data object x i Less dissimilar to the data objects of the same cluster as it is, and more dissimilar to the data objects of a different cluster as it is. Thus, at d' i,j =[d′ i,1 ,d′ i,2 ,...,d′ i,j ,d′ i,j+1 ,...,d′ i,n ]There must be a critical position in the material such that d' i,j+1 And d' i,j Is the largest difference of (2) to consider data object x i And data object a belongs to the same cluster and data object b belongs to a different cluster. Computing data object x according to the Squang YF formula i D of (2) c,i Values as shown in equation (13):
d c the value is defined as set d c,i The minimum value of (2) is shown in equation (14).
d c =min(d c,i ) (14)
IKMCA determines the initial cluster center based on two assumptions: (1) The local neighborhood density of cluster centers is higher than the surrounding non-cluster center points. (2) the relative distance between the cluster centers is large. Based on the above assumptions, this section presents a method for automatic determination of the initial cluster center. As shown in fig. 5, there is a two-dimensional example data set, with a total of 93 data objects, 2 clusters, corresponding to 2 cluster centers.
The selection of the DP algorithm cluster center is determined by a decision diagramAnd (3) determining. The horizontal axis of the decision graph shown in FIG. 6 is the data object x i Local neighborhood density ρ of (1) i The vertical axis is the relative distance L i 。ρ i And L i The value of which is simultaneously large is the cluster center of the data set. The two points in the upper right corner of fig. 6 are the cluster centers corresponding to the two clusters in fig. 6. Surrounding the cluster center is a large number of data objects with local neighborhood density ρ i And relative distance L i Are larger.
For more visual observation and determination of cluster center, consider the use of Z i Decision graphs select cluster centers. The local neighborhood density and the relative distance of each data object are obtained through the formula (8) and the formula (9). According to formula Z i =ρ i ×L i Computing Z for all data objects i Value of Z i The values are sorted in descending order to obtain a sorting sequence Z (1) >Z (2) >...Z (n) Wherein Z is (1) >Z (2) >...Z (k) ,(k<n) the corresponding point is the cluster center. As shown in FIG. 7, the horizontal axis of the Zi decision graph is the subscript of the data object xi and the vertical axis is Z i Value, Z i The larger the value the more likely it is that the cluster center. Transition from cluster center to non-cluster center, Z i The graph has a very obvious inflection point, the abscissa corresponding to the inflection point is the clustering number k, the data object at the left part of the inflection point is the cluster center point, and the data object at the right part of the inflection point is the non-cluster center point. As shown in fig. 7, the two points at the upper left are cluster center points, and the points at the lower right are all non-cluster center points.
The new inter-cluster dissimilarity coefficient in the cluster enables the classification data dissimilarity calculation to be more accurate. Autonomous selection of the initial cluster center avoids uncertainty of clustering results caused by random selection of a classical k-modes algorithm or manual setting. The DP algorithm gives only d c The present invention gives d on the basis of the DP algorithm c Is a well-defined method of determining. The intra-cluster inter-cluster dissimilarity coefficient is applied to the classical k-modes algorithm, whose objective function definition is shown in equation (15). Theorem 1 shows how to minimize the objective function F (U, Q).
u il ∈{0,1},1≤i≤n,1≤l≤k (16)
U n×k Is a membership matrix satisfying constraint conditions (16-18), u il =1 represents x i Belonging to cluster C l . When the constraint conditions (16 to 18) are satisfied, the objective function F (U, Q) reaches a minimum value, and at this time, it can be judged that the clustering algorithm is ended.
Theorem 1: cluster center selection of IKMCA should be such that the function F (U, Q) is minimized if and only ift s ≠t h (s is more than or equal to 1 and less than or equal to m). The text describes that each characteristic value of the cluster center should select the characteristic value with the largest frequency of occurrence on each characteristic in the data set.
f s,t (x i )(1≤s≤m,1≤t≤n s ) Represents x i Take on the value A under the s-th characteristic s,t As shown in formula (19):
f s,t (x i )=|{x i ∈D,x i,s =A s,t }| (19)
if and only if q l =DOM(A s ) When 1.ltoreq.s.ltoreq.m and formula (20) is satisfied, the function F (U, Q) is minimized:
in order to make the objective function F (U, Q) reach a minimum value, an improved k-mode algorithm (IKMCA) based on intra-cluster dissimilarity, that is, a classification type data clustering method based on intra-cluster dissimilarity, as shown in fig. 8, is described as follows:
input: a classifier data set D having n data objects, m classifier features. And (3) outputting: clustering c= { C by clustering completion 1 ,C 2 ,...,C k }。
step1 calculating the dissimilarity d by equation (1) i,j And obtain a different matrix d n×n ;
step2. Calculate the cutoff distance dc according to equation (14);
step3 calculating the local neighborhood density ρ using equation (7) or equation (8) i ;
step4 calculating the relative distance L using equation (9) i ;
step5 according to formula Z i =ρ i ×L i Calculating to obtain Z i ={Z 1 ,Z 2 ,...,Z n };
step6Z i Descending order to obtain an ordered sequence Z (1)> Z (2) >...>Z (n) . In data object x i Is marked with the subscript of Z as the abscissa i Plotting Z for ordinate i And determining an inflection point in the graph, wherein the abscissa value at the inflection point is the optimal k value.
Step7. determining k value and initial cluster center set q (0) ={q 1 ,q 2 ,...,q k };
Step8. calculate the dissimilarity d (x) between n-k data objects in the dataset and the k initial cluster centers according to equation (6) i ,q l );
Step9, distributing the data object to the initial cluster nearest to the data object according to the nearby principle, and obtaining k clusters C after the distribution (1) ={C 1 ,C 2 ,...,C k -marking cluster labels of the (n-k) data objects;
step10. update cluster center q on newly formed cluster according to theorem 1 (1) ={q 1 ,q 2 ,..,q k };
Step11. Repeating steps 8-11 until the objective function value is no longer changed. If the objective function value is not changed any more, ending the algorithm; otherwise, repeating Step8 to continue execution;
step12, ending the algorithm to finish clustering.
Let l be the number of iterations required for algorithm convergence, typically n > > m, k, l. The temporal complexity of the IKMAC algorithm is mainly to update cluster centers and dissimilarities in each iteration. The decision diagram decision needs to be observed manually to initialize the cluster center, so the time complexity of this stage is temporarily not considered in the overall algorithm. The temporal complexity of updating the cluster center and dissimilarity computation in each iteration using intra-cluster dissimilarity is l (O (nmk) +o (nmk))=o (nmkl), so the total temporal complexity of IKMCA is O (nmkl). From the above analysis, it can be found that the temporal complexity of the IKMCA algorithm versus the number of data objects, the number of clusters and the number of features, which make inter-cluster dissimilarity, are linearly scalable.
In order to evaluate the effectiveness of the proposed algorithm, the clustering results are evaluated from three indexes of clustering precision AC, purity PR and recall RE, respectively, as shown in the formulas (21) to (23). NUM+ indicates that the cluster C is correctly partitioned l The number of data objects; NUM-indicates that cluster C has not been correctly partitioned l The number of data objects; NUM represents that it should be divided into clusters C l But is not actually divided into cluster C l Is a number of data objects of the (c). The closer the clustering result is to the true partitioning of the dataset, the larger the values of AC, PE and RE, the more efficient the algorithm.
The algorithm was implemented in Python language and all experiments were run on the intel (R) Core (TM) processor i7-8700K CPU@3.70GHz,Windows 10 operating system. The usage dataset is from the real dataset. UCI is a real dataset provided by university of california, euler's university, that is dedicated to machine learning. In order to test the effectiveness of the algorithm, mushroom (abbreviated as Mus), breast-cancer (abbreviated as Bre) data sets of Car and Soybean-small (abbreviated as Soy) are selected from the UCI data set for experimental verification. Table 5 lists these dataset details.
Table 5 dataset description
The IKMCA algorithm provided by the invention, the k-modes algorithm provided by Huang, the IDMKCA algorithm provided by Ng et al and the EKACMD algorithm provided by Ravi et al are respectively operated for 30 times to obtain average values. AC. The calculation results of PE and RE are shown in tables 6 to 9.
Table 6 experimental results of four algorithms under the Mus dataset
Table 7 experimental results of four algorithms under Bre dataset
Table 8 experimental results of four algorithms under Car dataset
Table 9 experimental results of four algorithms under the so dataset
From the above experimental results, it can be seen that IKMCA is in most cases superior to the k-modes algorithm, the IDMKCA algorithm and the EKACMD algorithm in AC, PR and RE for Mus, bre, car and so datasets. The reason the IMKCA algorithm is superior to the classical k-modes algorithm is that the preprocessing of the k-modes algorithm destroys the original structure of the classification type features. Calculating the dissimilarity coefficient using a simple hamming distance for the converted classification feature values does not reveal the dissimilarity between the classification data. When the features of the dataset are very numerous, a simple 0-1 contrast may produce very large degrees of dissimilarity, and may also produce very small degrees of dissimilarity, even dissimilarity. The proposed intra-cluster-to-cluster dissimilarity can better reveal the structure of the dataset than the classical k-modes algorithm and the IDMKCA, EKACMD algorithm.
The classical k-modes algorithm uses simple hamming distances for dissimilarity computation, weakening similarity between clusters and ignoring similarity between clusters. Aiming at the problems, the invention provides a new dissimilarity calculating method based on the similarity between clusters in the clusters. The method can prevent the loss of important characteristic values in the clustering process, strengthen the similarity between the characteristic values in the clusters and weaken the similarity between the characteristic values between the clusters. The automatic cluster center selection method greatly reduces errors caused by randomly selecting the cluster center or manually selecting the cluster center. The present invention discusses the limitations of using simple hamming distances among other several dissimilarity coefficients in the k-mode algorithm with some illustrative examples and proposes a new dissimilarity coefficient. The classification type data clustering algorithm based on the dissimilarity coefficient improvement provided by the invention and the k-modes algorithm based on other dissimilarity coefficients are subjected to experiments on UCI data sets. Experimental results show that the dissimilarity coefficient calculation method provided by the invention maintains the characteristics of data, achieves the standard of low-cluster dissimilarity and high-cluster dissimilarity, improves the clustering precision, purity and recall rate, and effectively improves the clustering effect of the classified data.
The algorithm (IKECA) provided by the invention can be applied to a rehabilitation therapy scheme recommendation system containing massive pure classification data, and on the rehabilitation therapy scheme recommendation system, classification data clustering can help analysts to distinguish different patient groups from a patient database, discover deep information distributed in the database and formulate a personalized rehabilitation scheme.
It should be noted that, although the examples described above are illustrative, this is not a limitation of the present invention, and thus the present invention is not limited to the above-described specific embodiments. Other embodiments, which are apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein, are considered to be within the scope of the invention as claimed.
Claims (2)
1. The classification data clustering method based on intra-cluster dissimilarity is applied to a rehabilitation therapy scheme recommendation system containing massive pure classification data, and on the rehabilitation therapy scheme recommendation system, classification data clustering can help analysts to distinguish different patient groups from a patient database and formulate a personalized rehabilitation scheme; the method is characterized by comprising the following steps:
step1, for a categorized data set D with n data objects, calculating the dissimilarity D between every 2 data objects using a simple Hamming distance i,j ;
Step2, for each data object x of the categorized dataset D i The data object x is first processed i Degree of dissimilarity d between correspondence with other data i,j Performing ascending sort to obtain the data object x i Is the dissimilarity vector d' i,j =[d' i,1 ,d' i,2 ,...,d' i,n ]The method comprises the steps of carrying out a first treatment on the surface of the And then the dissimilarity vector d' i,j The maximum difference between two adjacent dissimilarities in a database is taken as a data object x i Is d of the cutting distance of (2) c,i ;
Step3, selecting the cut-off distance D of all data objects in the data set D of the sub-type c,i As the cut-off distance D of the classified dataset D c ;
Step4, cutting distance D based on classified data set D c And utilize square wave kernel function method or Gaussian kernel function methodEach data object x of the sorted dataset D is calculated i Local neighborhood density ρ of (1) i ;
Step5, calculating each data object x of the data set D of the classification type i The relative distance L of (2) i :
Step6, for each data object x of the categorized dataset D i Using the data object x i Local neighborhood density ρ of (1) i And relative distance L i Obtaining the data object x i Decision diagram Z of (2) i :
Z i =ρ i ×L i
Step7, firstly, the decision diagram Z of all data objects in the classified data set D i Sequencing in a descending order to obtain a sequencing sequence; based on the ordered sequence, data object x i The subscript i of (1) is the abscissa, and the data object x is taken as the abscissa i Decision diagram Z of (2) i Drawing a decision graph of the classified data set D for an ordinate, wherein the abscissa at the inflection point in the decision graph of the classified data set D is the selected cluster number k;
step8, selecting k data objects from the classified data set D to form a current cluster center set;
step9, calculating the remaining n-k data objects x of the classified data set D based on the current cluster center set i And k cluster centers q l Degree of dissimilarity d (x) i ,q l ):
Step10, according to the data object x i And cluster center q l Degree of dissimilarity d (x) i ,q l ) And based on the principle of close proximity, n-k data objects are distributed into the clusters nearest to the n-k data objects, k clustering clusters are obtained after distribution is completed, and the n- -Cluster labels of k data objects, thereby obtaining a clustering result based on the current cluster center set;
step11, selecting the feature value with the highest occurrence frequency on each dimension feature from each cluster for forming a new cluster center of the cluster for the formed k cluster clusters, so as to obtain a new cluster center set;
step12, repeating the steps 9-11 until the center of each cluster is not changed or the specified maximum iteration number is reached, terminating the algorithm, and outputting a clustering result based on the current cluster center set; otherwise, taking the obtained new cluster center set as the current cluster center set, and jumping to the step9 to continue iteration;
where i, j=1, 2, …, n, n is the number of data objects of the categorized dataset D; s=1, 2, …, m, m is the number of features of the data object; l=1, 2, …, k is the number of clusters; delta (A) i,s ,A ql,s ) For data object x under s-th dimension characteristic i With the center q of the cluster l Is the degree of dissimilarity of (3); a is that i,s Data object x i Is the s-th dimensional feature of (2); a is that ql,s Is a cluster center q l Is the s-th dimensional feature of (2);is cluster C l In, the characteristic value is A s,t The number of data objects; c l I is cluster C l The number of data objects within; zeta type l For adjusting the coefficients.
2. The method for clustering classified data based on intra-cluster dissimilarity according to claim 1, wherein in step4, for a large-scale classified data set D having a data object of 10TB or more, a square wave kernel function method is used to calculate a data object x i Local neighborhood density ρ of (1) i The method comprises the steps of carrying out a first treatment on the surface of the For small-scale classified data set D with data object less than 10TB, the data object x is calculated by Gaussian kernel function method i Local neighborhood density ρ of (1) i 。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011009696.6A CN112132217B (en) | 2020-09-23 | 2020-09-23 | Classification type data clustering method based on inter-cluster dissimilarity in clusters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011009696.6A CN112132217B (en) | 2020-09-23 | 2020-09-23 | Classification type data clustering method based on inter-cluster dissimilarity in clusters |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112132217A CN112132217A (en) | 2020-12-25 |
CN112132217B true CN112132217B (en) | 2023-08-15 |
Family
ID=73841250
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011009696.6A Active CN112132217B (en) | 2020-09-23 | 2020-09-23 | Classification type data clustering method based on inter-cluster dissimilarity in clusters |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112132217B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114970649B (en) * | 2021-02-23 | 2024-07-26 | 广东精点数据科技股份有限公司 | Network information processing method based on clustering algorithm |
CN117853152B (en) * | 2024-03-07 | 2024-05-17 | 云南疆恒科技有限公司 | Business marketing data processing system based on multiple channels |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6049797A (en) * | 1998-04-07 | 2000-04-11 | Lucent Technologies, Inc. | Method, apparatus and programmed medium for clustering databases with categorical attributes |
CN107122793A (en) * | 2017-03-23 | 2017-09-01 | 北京航空航天大学 | A kind of improved global optimization k modes clustering methods |
CN107358368A (en) * | 2017-07-21 | 2017-11-17 | 国网四川省电力公司眉山供电公司 | A kind of robust k means clustering methods towards power consumer subdivision |
CN108510010A (en) * | 2018-04-17 | 2018-09-07 | 中国矿业大学 | A kind of density peaks clustering method and system based on prescreening |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170371886A1 (en) * | 2016-06-22 | 2017-12-28 | Agency For Science, Technology And Research | Methods for identifying clusters in a dataset, methods of analyzing cytometry data with the aid of a computer and methods of detecting cell sub-populations in a plurality of cells |
-
2020
- 2020-09-23 CN CN202011009696.6A patent/CN112132217B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6049797A (en) * | 1998-04-07 | 2000-04-11 | Lucent Technologies, Inc. | Method, apparatus and programmed medium for clustering databases with categorical attributes |
CN107122793A (en) * | 2017-03-23 | 2017-09-01 | 北京航空航天大学 | A kind of improved global optimization k modes clustering methods |
CN107358368A (en) * | 2017-07-21 | 2017-11-17 | 国网四川省电力公司眉山供电公司 | A kind of robust k means clustering methods towards power consumer subdivision |
CN108510010A (en) * | 2018-04-17 | 2018-09-07 | 中国矿业大学 | A kind of density peaks clustering method and system based on prescreening |
Non-Patent Citations (1)
Title |
---|
一种面向混合型数据聚类的k-prototypes聚类算法;贾子琪;宋玲;;小型微型计算机系统(第09期);1845-1852 * |
Also Published As
Publication number | Publication date |
---|---|
CN112132217A (en) | 2020-12-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
De Amorim | Feature relevance in ward’s hierarchical clustering using the L p norm | |
Sra | Directional statistics in machine learning: a brief review | |
Wang et al. | Locality sensitive outlier detection: A ranking driven approach | |
Panday et al. | Feature weighting as a tool for unsupervised feature selection | |
CN112132217B (en) | Classification type data clustering method based on inter-cluster dissimilarity in clusters | |
CN113807456B (en) | Feature screening and association rule multi-label classification method based on mutual information | |
Xu et al. | A feasible density peaks clustering algorithm with a merging strategy | |
CN113344019A (en) | K-means algorithm for improving decision value selection initial clustering center | |
Chen et al. | Central clustering of categorical data with automated feature weighting | |
Zhang et al. | Attribute-augmented semantic hierarchy: Towards a unified framework for content-based image retrieval | |
Jeong et al. | Data depth based clustering analysis | |
Wang et al. | Information theoretical clustering via semidefinite programming | |
JP2013073256A (en) | Approximate nearest neighbor search method, nearest neighbor search program, and nearest neighbor search device | |
Cheng et al. | A local cores-based hierarchical clustering algorithm for data sets with complex structures | |
Li et al. | A novel rough fuzzy clustering algorithm with a new similarity measurement | |
CN107704872A (en) | A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method | |
Ding et al. | Density peaks clustering algorithm based on improved similarity and allocation strategy | |
CN108710690A (en) | Medical image search method based on geometric verification | |
Jiang et al. | An unsupervised feature selection framework based on clustering | |
Tao et al. | An optimal density peak algorithm based on data field and information entropy | |
Altintakan et al. | An improved BOW approach using fuzzy feature encoding and visual-word weighting | |
Sun et al. | On the Adaptive Selection of the Cutoff Distance and Cluster Centers Based on CFSFDP | |
Maji et al. | Rough hypercuboid based supervised regularized canonical correlation for multimodal data analysis | |
Cao et al. | Incomplete data fuzzy C-means method based on spatial distance of sample | |
Sun et al. | Research on the optimized clustering method based on CFSFDP |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |