CN112132217B - A Categorical Data Clustering Method Based on Intra-cluster Dissimilarity - Google Patents
A Categorical Data Clustering Method Based on Intra-cluster Dissimilarity Download PDFInfo
- Publication number
- CN112132217B CN112132217B CN202011009696.6A CN202011009696A CN112132217B CN 112132217 B CN112132217 B CN 112132217B CN 202011009696 A CN202011009696 A CN 202011009696A CN 112132217 B CN112132217 B CN 112132217B
- Authority
- CN
- China
- Prior art keywords
- cluster
- data
- dissimilarity
- data object
- clusters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000004422 calculation algorithm Methods 0.000 claims description 74
- 238000010586 diagram Methods 0.000 claims description 18
- 230000001174 ascending effect Effects 0.000 claims description 4
- 239000013598 vector Substances 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 4
- 238000010187 selection method Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 22
- 238000004364 calculation method Methods 0.000 description 15
- 238000009826 distribution Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 3
- 241000053227 Themus Species 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 235000001674 Agaricus brunnescens Nutrition 0.000 description 1
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000011423 initialization method Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域technical field
本发明涉及数据聚类技术领域,具体涉及一种基于簇内簇间相异度的分类型数据聚类方法。The invention relates to the technical field of data clustering, in particular to a clustering method for classified data based on intra-cluster and inter-cluster dissimilarity.
背景技术Background technique
聚类算法是机器学习中涉及对数据进行分组的一种算法。在给定的数据集中,我们可以通过聚类算法将其分成一些不同的组。在理论上,相同的组的数据之间有相同的属性或者是特征,不同组数据之间的属性或者特征相差就会比较大。聚类算法是一种非监督学习算法,并且作为一种常用的数据分析算法在很多领域上得到应用。A clustering algorithm is an algorithm in machine learning that involves grouping data. In a given dataset, we can divide it into some different groups by clustering algorithm. In theory, the data of the same group have the same attributes or characteristics, and the attributes or characteristics of different groups of data will differ greatly. Clustering algorithm is a kind of unsupervised learning algorithm, and as a commonly used data analysis algorithm, it has been applied in many fields.
在数据科学领域,我们利用聚类算法实现聚类分析,通过将数据分组可以比较清晰的获取到数据信息。分类型数据聚类算法作为数据挖掘的重要组成部分,可以帮助分析人员从康复治疗方案推荐系统的数据库中概括出每一类患者的病情特点,让分析人员把注意力放在某一个特定的患者群体上,以做出进一步的分析。In the field of data science, we use clustering algorithms to implement cluster analysis, and we can obtain data information more clearly by grouping data. As an important part of data mining, the classification data clustering algorithm can help analysts summarize the disease characteristics of each type of patient from the database of the rehabilitation treatment plan recommendation system, allowing analysts to focus on a specific patient groups for further analysis.
经典k-means算法在计算簇的均值以及数据对象之间的相异度时使用的是欧式距离,仅适用于连续特征的数值型数据集,对于离散特征的分类型数据集,k-means算法不再适用。Huang在1998年对k-means算法进行扩展,使用“modes”代替“means”,提出适用于分类型数据聚类的k-modes算法。k-modes算法采用简单汉明距离计算相异度,忽略了数据对象间同一分类特征的差异性,弱化了簇内相似性,没有充分反映同一分类特征下两个特征值之间的相异度,影响聚类结果的精度。另外,k-modes算法采用随机选择的方法确定初始簇中心和k值,采用基于频率的方法重新计算和更新簇中心,给聚类结果带来很大的不确定性。The classic k-means algorithm uses the Euclidean distance when calculating the mean value of the cluster and the dissimilarity between data objects, which is only suitable for numerical data sets with continuous features. For categorical data sets with discrete features, the k-means algorithm No longer applicable. Huang extended the k-means algorithm in 1998, using "modes" instead of "means", and proposed a k-modes algorithm suitable for categorical data clustering. The k-modes algorithm uses a simple Hamming distance to calculate the dissimilarity, which ignores the difference of the same classification feature between data objects, weakens the similarity within the cluster, and does not fully reflect the dissimilarity between two feature values under the same classification feature. , affecting the accuracy of clustering results. In addition, the k-modes algorithm uses a random selection method to determine the initial cluster center and k value, and uses a frequency-based method to recalculate and update the cluster center, which brings great uncertainty to the clustering results.
发明内容Contents of the invention
本发明所要解决的是k-modes算法的精度和初始簇中心选择的问题,提供一种基于簇内簇间相异度的分类型数据聚类方法。The present invention aims to solve the problem of the accuracy of the k-modes algorithm and the selection of the initial cluster center, and provides a classification data clustering method based on the dissimilarity between clusters within a cluster.
为解决上述问题,本发明是通过以下技术方案实现的:In order to solve the above problems, the present invention is achieved through the following technical solutions:
一种基于簇内簇间相异度的分类型数据聚类方法,包括步骤如下:A clustering method for classified data based on intra-cluster and inter-cluster dissimilarity, comprising the following steps:
步骤1、对于具有n个数据对象的分类型数据集D,利用简单汉明距离计算每2个数据对象之间的相异度di,j;Step 1. For a classified data set D with n data objects, use the simple Hamming distance to calculate the dissimilarity d i,j between every two data objects;
步骤2、对于分类型数据集D的每个数据对象xi,先将该数据对象xi与其他数据对应之间的相异度di,j进行升序排序,得到该数据对象xi的相异度向量d′i,j=[d′i,1,d′i,2,...,d′i,n];再将该相异度向量d′i,j中相邻两个相异度的最大差值作为数据对象xi的截断距离dc,i;Step 2. For each data object x i of the sub-type data set D, first sort the dissimilarity d i, j between the data object x i and other data correspondences in ascending order, and obtain the corresponding data object x i Dissimilarity vector d′ i,j =[d′ i,1 ,d′ i,2 ,...,d′ i,n ]; then the two adjacent dissimilarity vectors d′ i,j The maximum difference of the dissimilarity is taken as the cut-off distance d c,i of the data object x i ;
步骤3、选取分类型数据集D中所有数据对象的截断距离dc,i的最小值作为分类型数据集D的截断距离dc;Step 3. Select the minimum value of the cut-off distance dc ,i of all data objects in the classification data set D as the cut-off distance dc of the classification data set D;
步骤4、基于分类型数据集D的截断距离dc,并利用方波内核函数法或高斯核函数法计算分类型数据集D的每个数据对象xi的局部邻域密度ρi;Step 4. Calculate the local neighborhood density ρ i of each data object x i in the classification data set D based on the cut-off distance d c of the classification data set D by using the square wave kernel function method or the Gaussian kernel function method;
步骤5、计算分类型数据集D的每个数据对象xi的相对距离Li:Step 5. Calculate the relative distance L i of each data object x i in the classification data set D:
步骤6、对于分类型数据集D的每个数据对象xi,利用该数据对象xi的局部邻域密度ρi和相对距离Li得到该数据对象xi的决策图Zi:Step 6. For each data object xi in the classification data set D, use the local neighborhood density ρ i and the relative distance L i of the data object xi to obtain the decision graph Z i of the data object xi :
Zi=ρi×Li Z i =ρ i ×L i
步骤7、先将分类型数据集D中所有数据对象的决策图Zi进行降序排序,得到排序序列;再基于该排序序列,以数据对象xi的下标i为横坐标,以数据对象xi的决策图Zi为纵坐标绘制分类型数据集D的决策图,该分类型数据集D的决策图的中的拐点处的横坐标即为选定的聚类个数k;Step 7. First sort the decision graph Z i of all data objects in the classification data set D in descending order to obtain a sorting sequence; then based on the sorting sequence, use the subscript i of the data object x i as the abscissa, and the data object x The decision diagram Z of i is the vertical coordinate to draw the decision diagram of the classification data set D, and the abscissa at the inflection point in the decision diagram of the classification data set D is the selected number of clusters k;
步骤8、从分类型数据集D中选择k个数据对象构成当前簇中心集合;Step 8, select k data objects from the classification data set D to form the current cluster center set;
步骤9、基于当前簇中心集合,计算分类型数据集D剩余的n-k个数据对象xi与k个簇中心ql之间的相异度d(xi,ql):Step 9. Based on the current set of cluster centers, calculate the dissimilarity d( xi ,q l ) between the remaining nk data objects xi and k cluster centers q l in the subtype data set D:
步骤10、根据数据对象xi与簇中心ql之间的相异度d(xi,ql),并基于就近原则将n-k个数据对象分配到离它最近的簇中,分配完成后,得到k个聚类簇,并标记这n-k个数据对象的簇标签,由此获得基于当前簇中心集合的聚类结果;Step 10. According to the dissimilarity d( xi ,q l ) between the data object xi and the cluster center q l , and based on the principle of proximity, assign nk data objects to the nearest cluster. After the assignment is completed, Obtain k clusters, and mark the cluster labels of these nk data objects, thereby obtaining the clustering result based on the current cluster center set;
步骤11、对于形成的k个聚类簇,从每个簇中选取每维特征上出现频率最高的特征值组成该簇新的簇中心,得到新的簇中心集合;Step 11. For the formed k clusters, select the eigenvalue with the highest frequency of occurrence on each dimension feature from each cluster to form a new cluster center of the cluster, and obtain a new set of cluster centers;
步骤12、重复步骤9-11,直到各簇中心不再变化时或达到规定的最大迭代次数时,算法终止,输出基于当前簇中心集合的聚类结果;否则,将所得到的新的簇中心集合作为当前簇中心集合,并跳至步骤9继续迭代;Step 12. Repeat steps 9-11 until each cluster center no longer changes or reaches the specified maximum number of iterations, the algorithm terminates, and the clustering result based on the current cluster center set is output; otherwise, the obtained new cluster center Set as the current cluster center set, and skip to step 9 to continue iteration;
迭代使得选取的簇中心越来越接近真实的簇中心,所以迭代过程会使聚类效果越来越好。聚类算法结束条件可由实验员根据实际情况具体选择:(1)迭代到达到最大迭代次数终止;(2)迭代到目标函数阈值终止。Iteration makes the selected cluster center closer to the real cluster center, so the iterative process will make the clustering effect better and better. The end conditions of the clustering algorithm can be specifically selected by the experimenter according to the actual situation: (1) iterate until the maximum number of iterations is reached; (2) iterate until the objective function threshold is terminated.
其中,i,j=1,2,…,n,n为分类型数据集D的数据对象个数;s=1,2,…,m,m为数据对象的特征的个数;l=1,2,…,k,为聚类个数;δ(Ai,s,Aql,s)为第s维特征下数据对象xi与蔟中心ql的相异度;Ai,s为数据对象xi的第s维特征;Aql,s为蔟中心ql的第s维特征;为簇Cl内,特征值为As,t的数据对象个数;|Cl|为簇Cl内的数据对象的个数;ζl为调节系数。Among them, i, j=1,2,...,n, n is the number of data objects in the classification data set D; s=1,2,...,m, m is the number of features of the data objects; l=1 ,2,...,k, is the number of clusters; δ(A i,s ,A ql,s ) is the dissimilarity between the data object x i and cluster center q l under the s-th dimension feature; A i,s is The s-th dimension feature of the data object x i ; A ql,s is the s-th dimension feature of the cluster center q l ; is the number of data objects whose feature value is As,t in cluster C l ; |C l | is the number of data objects in cluster C l ; ζ l is the adjustment coefficient.
上述步骤4中,对于数据对象大于等于10TB的大规模分类型数据集D,利用方波内核函数法计算数据对象xi的局部邻域密度ρi;对于数据对象小于10TB的小规模分类型数据集D,利用高斯核函数法计算数据对象xi的局部邻域密度ρi。注:大规模数据一般指在10TB(1TB=1024GB)规模以上的数据量。In the above step 4, for the large-scale classification data set D whose data object is greater than or equal to 10TB, use the square wave kernel function method to calculate the local neighborhood density ρ i of the data object x i ; for the small-scale classification data whose data object is less than 10TB Set D, use the Gaussian kernel function method to calculate the local neighborhood density ρ i of the data object xi . Note: Large-scale data generally refers to the amount of data above 10TB (1TB=1024GB).
与现有技术相比,本发明具有如下特点:Compared with prior art, the present invention has following characteristics:
1、本发明基于簇内簇间相似性提出新的相异度计算方法,其可以防止聚类过程中的重要特征值的丢失,强化了簇内特征值之间的相似性,弱化了簇间特征值之间的相似性。1. The present invention proposes a new dissimilarity calculation method based on the similarity between clusters within a cluster, which can prevent the loss of important eigenvalues in the clustering process, strengthen the similarity between eigenvalues within a cluster, and weaken the similarity between clusters. Similarity between eigenvalues.
2、本发明提出的簇中心自动选择方法,大大减少了随机选取簇中心或者手动选择选取簇中心给聚类带来的误差。2. The automatic selection method of the cluster center proposed by the present invention greatly reduces the error brought to the clustering by randomly selecting the cluster center or manually selecting the cluster center.
3、本发明提出的相异度系数计算方法保留了数据的特征,做到了低簇内相异度高簇间相异性的标准,在聚类精度、纯度和召回率方面均有提高,有效提高了分类型数据的聚类效果。3. The dissimilarity coefficient calculation method proposed by the present invention retains the characteristics of the data, achieves the standard of low intra-cluster dissimilarity and high inter-cluster dissimilarity, improves clustering accuracy, purity and recall rate, and effectively improves Clustering effect of categorical data.
附图说明Description of drawings
图1为k-modes算法对初始簇中心选取的敏感性示意图,(a)k=1的聚类结果图,(b)k=2的聚类结果图,(c)k=3的聚类结果图。Figure 1 is a schematic diagram of the sensitivity of the k-modes algorithm to the selection of the initial cluster center, (a) the clustering result graph of k=1, (b) the clustering result graph of k=2, (c) the clustering result graph of k=3 Result graph.
图2为xi的局部邻域密度不是最大密度时的情况图。Figure 2 is a diagram of the situation when the local neighborhood density of x i is not the maximum density.
图3为xi的局部邻域密度是最大密度时的情况图。Figure 3 is a diagram of the situation when the local neighborhood density of x i is the maximum density.
图4为dc,i值的确定图。Figure 4 is a diagram of the determination of d c, i values.
图5为二维数据集示意图。Figure 5 is a schematic diagram of a two-dimensional data set.
图6为决策图。Figure 6 is a decision diagram.
图7为Zi决策图。Fig. 7 is a decision diagram of Z i .
图8为IKMCA流程图。Figure 8 is a flowchart of IKMCA.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实例,对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific examples.
使用的相关符号及含义说明如表1所示。The symbols used and their meanings are shown in Table 1.
表1符号说明Table 1 Symbol Description
以数据对象xi和簇中心ql为例,定义经典k-modes算法的简单汉明距离,如公式(1)所示,此计算赋予各特征相同的权重。Taking the data object xi and the cluster center q l as examples, define the simple Hamming distance of the classic k-modes algorithm, as shown in formula (1), this calculation gives each feature the same weight.
其中: in:
k-modes算法通过简单汉明距离来最小化的目标函数。如公式(2)所示:The k-modes algorithm minimizes the objective function by simple Hamming distance. As shown in formula (2):
在相异度系数上,经典k-modes算法的相异度系数没有考虑簇内特征值出现的相对频率,也没有考虑各特征的簇内簇间结构。导致新数据对象划分过程中,一些簇分配了较少的相似数据。为了方便说明,采用如表2所示的人工数据集D1对相异度系数进行论证。D1由三个特征描述A={A1,A2,A3}。其中,DOM(A1)={A,B},DOM(A2)={E,F},DOM(A3)={H,I}。D1有两个聚类簇C1和C2,分别对应的簇中心q1(A,E,H)和q2(A,E,H)。In terms of the dissimilarity coefficient, the dissimilarity coefficient of the classical k-modes algorithm does not consider the relative frequency of eigenvalues within a cluster, nor does it consider the intra-cluster and inter-cluster structure of each feature. In the process of partitioning new data objects, some clusters allocate less similar data. For the convenience of explanation, the artificial data set D1 shown in Table 2 is used to demonstrate the dissimilarity coefficient. D 1 is described by three features A={A 1 , A 2 , A 3 }. Wherein, DOM(A 1 )={A,B}, DOM(A 2 )={E,F}, DOM(A 3 )={H,I}. D 1 has two clusters C 1 and C 2 , corresponding to cluster centers q 1 (A,E,H) and q 2 (A,E,H) respectively.
表2人工数据集D1 Table 2 Artificial Dataset D 1
假设需要对x7=(A,E,H)进行聚类划分,使用简单满名距离可得d(x7,q1)=d(x7,q2)=0+0+0=0。但以簇内相似性而言,应该将x7划分给簇C1。Assuming that x 7 =(A,E,H) needs to be clustered, using the simple full-name distance, d(x 7 ,q 1 )=d(x 7 ,q 2 )=0+0+0=0 . But in terms of intra-cluster similarity, x 7 should be assigned to cluster C 1 .
在初始簇中心选择上,经典k-modes算法对初始簇中心非常敏感,初始簇中心的选择采用随机初始化法或者人工设置法,这两种方法都在一定程度上导致了聚类结果不稳定。选择不同位置和k值的初始簇中心,会产生不同的聚类结果。如图1所示,该数据集的真实簇数是3。选择不同初始簇中心,设置不同的k值,可能产生不同聚类结果,图1内容从左到右依次为:随机选取初始簇中心,聚类迭代过程,最终聚类结果。可见寻找合适的初始簇中心非常重要。In the selection of the initial cluster center, the classical k-modes algorithm is very sensitive to the initial cluster center, and the selection of the initial cluster center adopts random initialization method or manual setting method, both of which lead to unstable clustering results to a certain extent. Choosing initial cluster centers with different positions and k values will produce different clustering results. As shown in Figure 1, the true number of clusters for this dataset is 3. Selecting different initial cluster centers and setting different k values may produce different clustering results. The contents of Figure 1 from left to right are: random selection of initial cluster centers, clustering iteration process, and final clustering results. It can be seen that it is very important to find a suitable initial cluster center.
如果选用的相异度系数可以发现数据集内全部或部分潜在的modes,那么对基于划分的k-modes算法来说事半功倍,使k-modes算法产生高效的聚类结果需满足簇内数据对象之间的相异度最小;簇间数据对象之间的相异度最大的条件。因此,本发明基于簇内簇间相似性提出一种新的相异度系数“簇内簇间相异度系数”。基于簇内簇间相异度的k-modes算法(IKMCA)算法使用基于改进的密度峰值算法确定初始簇中心,使用簇内簇间相异度系数计算各数据对象与簇中心之间的相异度,并更新簇中心。If the selected dissimilarity coefficient can discover all or part of the potential modes in the data set, then the division-based k-modes algorithm will do more with less, so that the k-modes algorithm can produce efficient clustering results that meet the requirements of the data objects in the cluster. The minimum dissimilarity between clusters; the condition for the largest dissimilarity between data objects between clusters. Therefore, the present invention proposes a new dissimilarity coefficient "intra-cluster dissimilarity coefficient" based on intra-cluster and inter-cluster similarity. The k-modes algorithm (IKMCA) algorithm based on intra-cluster and inter-cluster dissimilarity uses the improved density peak algorithm to determine the initial cluster center, and uses the intra-cluster inter-cluster dissimilarity coefficient to calculate the dissimilarity between each data object and the cluster center degree, and update the cluster centers.
簇内簇间相异度系数考虑特征值在同一簇内分布的相对频率。属于同簇的数据对象,其相同的特征值出现的频率较高,簇内相似性也较高。簇内相异度定义如公式(3)所示:The within-cluster, between-cluster dissimilarity coefficient takes into account the relative frequency with which eigenvalues are distributed within the same cluster. Data objects belonging to the same cluster have higher frequency of occurrence of the same eigenvalue, and higher similarity within the cluster. The definition of intra-cluster dissimilarity is shown in formula (3):
其中:1≤i≤n,1≤s≤m。Among them: 1≤i≤n, 1≤s≤m.
使用数据集D1,根据公式(3)可得d(x10,q1)=(1-2/3)+(1-2/3)+(1-1)=2/3,d(x10,q2)=(1-2/3)+(1-2/3)+(1-2/3)=1。由计算结果可知,x7与簇C1具有最小相异度,因此x7应该被划分到簇C1内。公式(3)虽然考虑了簇内特征值的相对频率,但没有考虑簇间特征值的分布。使用如表3所示人工数据集D2讨论不考虑簇间相似度的缺陷。D2由三个分类型特征描述A={A1,A2,A3}。其中,DOM=(A1)={A,B,C},DOM(A2)={E,F},DOM(A3)={H,I,J}。D2有三个聚类簇C1,C2和C3分别对应簇中心q1=(A,E,H),q2=(A,E,H)和q3=(B,E,I)。Using data set D 1 , according to formula (3), d(x 10 ,q 1 )=(1-2/3)+(1-2/3)+(1-1)=2/3, d( x 10 ,q 2 )=(1-2/3)+(1-2/3)+(1-2/3)=1. It can be seen from the calculation results that x 7 has the minimum dissimilarity with cluster C 1 , so x 7 should be divided into cluster C 1 . Although formula (3) considers the relative frequency of eigenvalues within a cluster, it does not consider the distribution of eigenvalues between clusters. The flaw of not considering the inter-cluster similarity is discussed using the artificial dataset D2 shown in Table 3. D 2 is described by three categorical features A={A 1 , A 2 , A 3 }. Wherein, DOM=(A 1 )={A,B,C}, DOM(A 2 )={E,F}, DOM(A 3 )={H,I,J}. D 2 has three clusters C 1 , C 2 and C 3 respectively corresponding to cluster centers q 1 =(A,E,H), q 2 =(A,E,H) and q 3 =(B,E,I ).
表3人工数据集D2 Table 3 Artificial dataset D 2
假设需要对x10=(A,E,H)进行聚类划分。使用简单汉明距离可得d(x10,q1)=d(x10,q2)=d(x10,q3)=0+0+0=0。使用公式(3)可得d(x10,q1)=(1-2/3)+(1-2/3)+(1-3/3)=2/3,d(x10,q2)=(1-2/3)+(1-3/3)+(1-2/3)=2/3,d(x10,q3)=1+0+1=2。由上述计算结果可知,简单汉明距离不能对x10进行聚类划分;公式(3)可以将x10划分给簇C1或簇C2,即公式(3)无法准确地确定x10的正确聚类划分。从“低簇内相异度高簇间相异度”角度观察数据集D2,可知将x10分配给簇C1更合适。因为将x10分配给簇C1后,会让簇C1和簇C2之间的相异度最大化。Assume that x 10 =(A, E, H) needs to be divided into clusters. Using the simple Hamming distance, d(x 10 ,q 1 )=d(x 10 ,q 2 )=d(x 10 ,q 3 )=0+0+0=0 can be obtained. Using formula (3), d(x 10 ,q 1 )=(1-2/3)+(1-2/3)+(1-3/3)=2/3, d(x 10 ,q 2 )=(1-2/3)+(1-3/3)+(1-2/3)=2/3, d(x 10 ,q 3 )=1+0+1=2. From the above calculation results, we can see that simple Hamming distance cannot cluster x 10 ; formula (3) can divide x 10 into cluster C 1 or cluster C 2 , that is, formula (3) cannot accurately determine the correctness of x 10 cluster division. Observing the data set D 2 from the perspective of "low intra-cluster dissimilarity and high inter-cluster dissimilarity", it can be seen that it is more appropriate to assign x 10 to cluster C 1 . Because after assigning x 10 to cluster C 1 , the dissimilarity between cluster C 1 and cluster C 2 will be maximized.
簇间相异度考虑特征值相对于所有簇分布的总频率。假设特征值仅在一个簇内频繁分布,意味该特征值和其它簇之间的差异性很大。簇内簇间相异度系数定义如公式(4)所示:Between-cluster dissimilarity considers the total frequency of feature values relative to all cluster distributions. Assuming that the eigenvalues are frequently distributed in only one cluster, it means that the eigenvalues are very different from other clusters. The definition of intra-cluster and inter-cluster dissimilarity coefficient is shown in formula (4):
其中:1≤i≤n,1≤s≤m。Among them: 1≤i≤n, 1≤s≤m.
使用公式(4)数据集D2进行计算d(x10,q1)=(1-2/3×2/4)+(1-2/3×2/8)+(1-3/3×3/5)=1.9;d(x10,q2)=(1-2/3×2/4)+(1-3/3×3/8)+(1-2/3×2/5)=2.025;d(x10,q3)=(1-0×1)+(1-3/3×3/8)+(1-0×1)=2.625。根据公式(4)的计算结果可知,x10与簇C1之间的相异度更小,这个结果与之前的分析一致,成功的对x10进行了聚类划分。下面使用公式(4)验证更为特殊的人工数据集D3。如表4所示,D3由三个特征描述A={A1,A2,A3}。其中,DOM=(A1)={A,B},DOM(A2)={E,F},DOM(A3)={H,I},三个聚类簇C1,C2和C3分别对应簇中心q1=(A,E,H),q2=(A,E,H)和q3=(A,E,H),A,E和H在D3中均匀分布,均出现6次。Use formula (4) to calculate data set D 2 d(x 10 ,q 1 )=(1-2/3×2/4)+(1-2/3×2/8)+(1-3/3 ×3/5)=1.9; d(x 10 ,q 2 )=(1-2/3×2/4)+(1-3/3×3/8)+(1-2/3×2/ 5)=2.025; d(x 10 ,q 3 )=(1-0×1)+(1-3/3×3/8)+(1-0×1)=2.625. According to the calculation result of formula (4), it can be seen that the dissimilarity between x 10 and cluster C 1 is smaller. This result is consistent with the previous analysis, and x 10 has been clustered successfully. Next, formula (4) is used to verify the more special artificial data set D 3 . As shown in Table 4, D 3 is described by three features A={A 1 , A 2 , A 3 }. Among them, DOM=(A 1 )={A,B}, DOM(A 2 )={E,F}, DOM(A 3 )={H,I}, three clusters C 1 , C 2 and C 3 corresponds to cluster centers q 1 = (A, E, H), q 2 = (A, E, H) and q 3 = (A, E, H), respectively, A, E and H are uniformly distributed in D 3 , appearing 6 times.
表4人工数据集D3 Table 4 Artificial Dataset D 3
分别使用简单汉明距离、公式(3)和公式(4)对x10(A,E,H)进行聚类划分。使用简单汉明距离可得d(x10,q1)=d(x10,q2)=d(x10,q3)=0+0+0=0;使用公式(3)可得d(x10,q1)=d(x10,q2)=d(x10,q3)=(1-2/3)+(1-2/3)+(1-2/3)=1;使用公式(4)可得d(x10,q1)=d(x10,q2)=d(x10,q3)=(1-2/3×2/6)+(1-2/3×2/6)+(1-2/3×2/6)=21/9。由上述计算结果可知,当特征值均匀分布时,上述三种相异度系数都无法正确的对x10进行聚类划分。因此再一次考虑完善簇内簇间相异度系数。Use the simple Hamming distance, formula (3) and formula (4) to cluster and divide x 10 (A, E, H). Use simple Hamming distance to get d(x 10 ,q 1 )=d(x 10 ,q 2 )=d(x 10 ,q 3 )=0+0+0=0; use formula (3) to get d (x 10 ,q 1 )=d(x 10 ,q 2 )=d(x 10 ,q 3 )=(1-2/3)+(1-2/3)+(1-2/3)= 1; use formula (4) to get d(x 10 ,q 1 )=d(x 10 ,q 2 )=d(x 10 ,q 3 )=(1-2/3×2/6)+(1 -2/3×2/6)+(1-2/3×2/6)=21/9. From the above calculation results, it can be known that when the eigenvalues are evenly distributed, none of the above three dissimilarity coefficients can correctly cluster x 10 . Therefore, once again consider improving the intra-cluster and inter-cluster dissimilarity coefficient.
取数据对象xi的特征值分布与所在簇的整体特征值分布进行比较,完善部分的定义如公式(5)所示:Take the eigenvalue distribution of the data object x i and compare it with the overall eigenvalue distribution of the cluster. The definition of the perfect part is shown in formula (5):
其中:xi是待划分数据对象,xj是簇Cl内的数据对象。重新定义的簇内簇间相异度系数如公式(6)所示:Among them: x i is the data object to be divided, x j is the data object in the cluster C l . The redefined intra-cluster and inter-cluster dissimilarity coefficient is shown in formula (6):
对任意xi,xj∈D,d均有以下性质:For any x i, x j ∈ D, d has the following properties:
自身距离:对所有xi,每个对象与自身的距离等于零d(xi,xi)=0。Self-distance: For all x i , the distance of each object to itself is equal to zero d(x i , x i )=0.
对称性:对所有xi和xj,xi到xj的距离等于xj到xi的距离d(xi,xj)=d(xj,xi)。Symmetry: For all x i and x j , the distance from x i to x j is equal to the distance d( xi , x j ) =d(x j , x i ) from x j to x i.
非负性:对所有的xi,xj,距离d是个非负值,当且仅当xi=xj时,d(xi,xj)=0。Non-negativity: for all x i , x j , the distance d is a non-negative value, and d( xi , x j )=0 if and only when x i =x j .
满足三角不等式:对所有xi和xj,d(xi,xj)≤d(xi,xh)+d(xh,xj)。Satisfy the triangle inequality: for all x i and x j , d(x i , x j )≤d(x i , x h )+d(x h , x j ).
使用公式(6)的再次对D3进行计算,可得 因此,新的相异度系数计算结果为d(x10,q1)=21/9+ζ1=2.6666,d(x10,q2)=21/9+ζ2=2.8148,d(x10,q3)=21/9+ζ3=2.8888。由以上计算结果可知应该将x10划分给簇C1。聚类划分结果符合实际情况,说明本发明提出的相异度系数方案可行。Using formula (6) to calculate D 3 again, we can get Therefore, the calculation result of the new dissimilarity coefficient is d(x 10 ,q 1 )=21/9+ζ 1 =2.6666, d(x 10 ,q 2 )=21/9+ζ 2 =2.8148, d(x 10 ,q 3 )=21/9+ζ 3 =2.8888. From the above calculation results, it can be seen that x 10 should be divided into cluster C 1 . The result of clustering and division conforms to the actual situation, which shows that the dissimilarity coefficient scheme proposed by the present invention is feasible.
2014年,Rodriguez等人提出密度峰值(DP)算法。DP算法是一种基于相对距离和局部邻域密度的新型聚类算法,处理的是数值型数据,其输入是数据对象间的相异度矩阵,因此通过合适的相异度系数计算出分类型数据之间的相异度,就可将DP算法应用到分类型数据聚类上。本节利用DP算法可以自动确定聚类簇数的优点去确定初始簇中心。In 2014, Rodriguez et al. proposed the Density Peak (DP) algorithm. DP algorithm is a new type of clustering algorithm based on relative distance and local neighborhood density. It deals with numerical data, and its input is the dissimilarity matrix between data objects. Therefore, the classification is calculated by appropriate dissimilarity coefficient. The degree of dissimilarity between the data can be used to apply the DP algorithm to the classification data clustering. In this section, the DP algorithm can automatically determine the advantages of the number of clusters to determine the initial cluster center.
数据对象xi的局部邻域密度ρi的值等价于以数据对象xi为圆心,以截断距离dc为半径区域内的数据对象个数。数据对象xi的局部邻域密度有方波内核函数法和高斯核函数法两种定义方法:The value of the local neighborhood density ρ i of the data object xi is equivalent to the number of data objects in the area with the data object xi as the center and the cut-off distance d c as the radius. The local neighborhood density of data object x i has two definition methods: square wave kernel function method and Gaussian kernel function method:
方波内核函数法适用于大规模数据集,方波内核函数法求ρi的定义如公式(7)所示:The square wave kernel function method is suitable for large-scale data sets. The definition of ρi obtained by the square wave kernel function method is shown in formula (7):
其中:di,j-dc≤0时,χ(x)=1;否则,χ(x)=0。Where: when d i,j -d c ≤0, χ(x)=1; otherwise, χ(x)=0.
如果数据集内对象数较少,采用方波内核函数计算ρi容易受统计误差影响导致聚类结果不准确,此时可采用高斯核函数法。高斯核函数是常用的密度估计方法,被广泛地应用在基于密度的聚类算法分析中。高斯核函数法求局部邻域密度ρi的定义如公式(8)所示:If the number of objects in the data set is small, the calculation of ρi using the square wave kernel function is likely to be affected by statistical errors and cause inaccurate clustering results. At this time, the Gaussian kernel function method can be used. The Gaussian kernel function is a commonly used density estimation method and is widely used in the analysis of density-based clustering algorithms. The definition of Gaussian kernel function method to calculate the local neighborhood density ρi is shown in formula (8):
其中:K(x)=exp{-x2}。Where: K(x)=exp{-x 2 }.
从公式(7)和公式(8)可知,dc的取值会直接影响ρi的大小,进而影响簇中心的选择和整个聚类结果。因此,确定合适的dc值对算法来说很重要。It can be seen from formula (7) and formula (8) that the value of d c will directly affect the size of ρ i , and then affect the selection of cluster center and the whole clustering result. Therefore, it is very important for the algorithm to determine the appropriate d c value.
根据Alex-Li公式计算Li的值,数据对象xi和xj之间的相对距离Li的定义如公式(9)所示:The value of Li is calculated according to the Alex-Li formula, and the relative distance Li between data objects x i and x j is defined as shown in formula (9):
当xi的ρi不是最大密度时,Li定义为在所有局部邻域密度比xi大的数据对象中,与xi距离最近的数据对象与xi之间的距离,如公式(10)和图2所示:When ρ i of xi is not the maximum density, L i is defined as the distance between the data object closest to xi and xi among all data objects whose local neighborhood density is larger than xi , as in the formula (10 ) and as shown in Figure 2:
当xi的ρi是最大密度时,Li定义为在所有局部邻域密度比xi大的数据对象中,距xi最远的数据对象与xi之间的距离,如公式(11)和图3所示。同时具备高Li和高ρi的数据对象即为簇中心。When ρ i of xi is the maximum density, Li is defined as the distance between the data object farthest from xi and xi among all the data objects whose local neighborhood density is larger than xi , such as the formula (11 ) and Figure 3. The data object with high L i and high ρ i at the same time is the cluster center.
截断距离dc是一个限定距离搜索范围的临界值。DP算法的dc值需要人为确定,将数据集中两两数据对象间的距离升序排列,取前1%至2%位置处的值即为dc值,是一个大概范围。在实际聚类问题中dc值设置过大,会导致求得的ρi重叠,dc值设置过小,会导致聚类簇分布稀疏。本发明给出详细的dc值确定方法。设定di,j=[di,1,di,2,..,di,n]为数据对象xi与xj的相异度。用公式(1)计算di,j值,然后对di,j升序排序得到d′i,j=[d′i,1,d′i,2,...,d′i,n]。xi的截断距离dc,i定义如公式(12)所示:The cutoff distance dc is a critical value that limits the distance search range. The d c value of the DP algorithm needs to be determined manually. The distance between two data objects in the data set is arranged in ascending order, and the value at the top 1% to 2% position is the d c value, which is an approximate range. In the actual clustering problem, if the value of d c is set too large, the obtained ρ i will overlap, and if the value of d c is set too small, the distribution of clusters will be sparse. The present invention provides a detailed d c value determination method. Set d i,j =[d i,1 ,d i,2 ,..,d i,n ] as the degree of dissimilarity between data objects x i and x j . Use formula (1) to calculate d i,j values, and then sort d i,j in ascending order to obtain d′ i,j =[d′ i,1 ,d′ i,2 ,...,d′ i,n ] . The cut-off distance d c,i of x i is defined as shown in formula (12):
其中:max(d′i,j+1-d′i,j)是d′i,j中相邻相异度的最大差值。Among them: max(d′ i,j+1 -d′ i,j ) is the maximum difference of adjacent dissimilarity in d′ i,j .
设定d′i,j=da,d′i,j+1=db12。如图4所示,数据对象xi与和它同簇的数据对象相异度较小,与和它不同簇的数据对象相异度较大。因此,在d′i,j=[d′i,1,d′i,2,...,d′i,j,d′i,j+1,...,d′i,n]内一定存在一个临界位置使得d′i,j+1与d′i,j的差值最大,认为数据对象xi和数据对象a属于同一簇,与数据对象b属于不同簇。根据Shuang YF公式计算数据对象xi的dc,i值,如公式(13)所示:Let d′ i,j =d a , d′ i,j+1 =d b 12 . As shown in Figure 4, the data object x i has a small dissimilarity with data objects in the same cluster as it, and a large dissimilarity with data objects in a different cluster with it. Therefore, at d′ i,j =[d′ i,1 ,d′ i,2 ,...,d′ i,j ,d′ i,j+1 ,...,d′ i,n ] There must be a critical position in which the difference between d′ i,j+1 and d′ i,j is the largest. It is considered that data object x i and data object a belong to the same cluster, and data object b belongs to a different cluster. Calculate the dc,i value of the data object x i according to the Shuang YF formula, as shown in formula (13):
dc值定义为集合dc,i的最小值如公式(14)所示。The value of d c is defined as the minimum value of set d c,i as shown in formula (14).
dc=min(dc,i) (14)d c =min(d c,i ) (14)
IKMCA基于以下两个假设确定初始簇中心:(1)簇中心的局部邻域密度高于周围非簇中心点。(2)各簇中心之间的相对距离较大。基于上述假设,本节给出初始簇中心自动确定的方法。如图5所示是一个二维示例数据集,共有93个数据对象,2个聚类簇,对应2个簇中心。IKMCA determines initial cluster centers based on the following two assumptions: (1) The local neighborhood density of cluster centers is higher than that of surrounding non-cluster centers. (2) The relative distance between the cluster centers is relatively large. Based on the above assumptions, this section presents a method for automatic determination of initial cluster centers. As shown in Figure 5, it is a two-dimensional example data set, with a total of 93 data objects, 2 clusters, corresponding to 2 cluster centers.
DP算法簇中心的选择是通过决策图确定的。如图6所示决策图的横轴为数据对象xi的局部邻域密度ρi,纵轴为相对距离Li。ρi和Li的值同时大的值即为数据集的簇中心。图6右上角的两个点即为图6中两个簇对应的簇中心。簇中心周围包围着大量的数据对象,其局部邻域密度ρi和相对距离Li都较大。The selection of the cluster center of the DP algorithm is determined through the decision diagram. As shown in Fig. 6, the horizontal axis of the decision graph is the local neighborhood density ρ i of the data object x i , and the vertical axis is the relative distance L i . The value of ρ i and L i that are large at the same time is the cluster center of the data set. The two points in the upper right corner of Figure 6 are the cluster centers corresponding to the two clusters in Figure 6. The cluster center is surrounded by a large number of data objects, and its local neighborhood density ρ i and relative distance L i are both large.
为了更加直观的观察和确定簇中心,考虑使用Zi决策图来选择簇中心。通过公式(8)和公式(9)得到每个数据对象的局部邻域密度和相对距离。根据公式Zi=ρi×Li计算出所有数据对象的Zi值,将Zi值降序排序得到排序序列Z(1)>Z(2)>...Z(n),其中Z(1)>Z(2)>...Z(k),(k<n)对应的点即为簇中心。如图7所示,Zi决策图的横轴是数据对象xi的下标,纵轴是Zi值,Zi值越大越有可能是簇中心。从簇中心过渡到非簇中心,Zi图上存在一个非常明显的拐点,拐点对应的横坐标即为聚类个数k,拐点左边部分数据对象即为簇中心点,拐点右边部分的数据对象即为非簇中心点。如图7所示,左上方的两个点是簇中心点,右下方分布非常平滑的点全部为非簇中心点。In order to observe and determine the cluster center more intuitively, consider using the Z i decision diagram to select the cluster center. The local neighborhood density and relative distance of each data object are obtained by formula (8) and formula (9). Calculate the Z i values of all data objects according to the formula Z i =ρ i ×L i , sort the Z i values in descending order to obtain the sorting sequence Z (1) >Z (2) >...Z (n) , where Z ( 1) >Z (2) >...Z (k) , the point corresponding to (k<n) is the cluster center. As shown in Figure 7, the horizontal axis of the Zi decision diagram is the subscript of the data object xi, and the vertical axis is the value of Zi . The larger the value of Zi , the more likely it is the cluster center. From the cluster center to the non-cluster center, there is a very obvious inflection point on the Z i diagram, the abscissa corresponding to the inflection point is the number of clusters k, the data object on the left side of the inflection point is the cluster center point, and the data object on the right side of the inflection point is the non-cluster center point. As shown in Figure 7, the two points on the upper left are cluster center points, and the points on the lower right with very smooth distribution are all non-cluster center points.
新的簇内簇间相异度系数使分类型数据相异度计算更加准确。初始簇中心的自主选取避免了经典k-modes算法随机选取或者人为手动设置带来的聚类结果不确定。DP算法只给出dc的大致范围,本发明在DP算法的基础上给出了dc的明确确定方法。将簇内簇间相异度系数应用到经典k-modes算法中,其目标函数定义如公式(15)所示。定理1展示了如何最小化目标函数F(U,Q)。The new intra-cluster and inter-cluster dissimilarity coefficient makes the dissimilarity calculation of subcategory data more accurate. The autonomous selection of the initial cluster center avoids the uncertainty of the clustering results caused by the random selection of the classic k-modes algorithm or manual setting. The DP algorithm only gives the approximate range of d c , and the present invention provides a definite determination method of d c on the basis of the DP algorithm. The intra-cluster and inter-cluster dissimilarity coefficient is applied to the classical k-modes algorithm, and its objective function is defined as shown in formula (15). Theorem 1 shows how to minimize the objective function F(U,Q).
uil∈{0,1},1≤i≤n,1≤l≤k (16)u il ∈ {0,1}, 1≤i≤n, 1≤l≤k (16)
Un×k是满足约束条件(16~18)的隶属度矩阵,uil=1表示xi属于簇Cl。在满足约束条件(16~18)的情况下,目标函数F(U,Q)达到极小值,此时可以判断聚类算法结束。U n×k is a membership degree matrix satisfying the constraints (16~18), and u il =1 means that x i belongs to cluster C l . When the constraint conditions (16-18) are met, the objective function F(U, Q) reaches the minimum value, and at this time, it can be judged that the clustering algorithm ends.
定理1:IKMCA的簇中心选取应使得函数F(U,Q)最小化,当且仅当ts≠th(1≤s≤m)。文字描述为,簇中心的各特征值应选取数据集中各特征上出现频率最大的特征值。Theorem 1: The cluster center selection of IKMCA should minimize the function F(U,Q) if and only if t s ≠t h (1≤s≤m). The text description is that each eigenvalue of the cluster center should select the eigenvalue with the highest frequency of occurrence on each feature in the data set.
fs,t(xi)(1≤s≤m,1≤t≤ns)表示xi在第s个特征下取值为As,t的个数,如公式(19)所示:f s,t ( xi )(1≤s≤m,1≤t≤n s ) indicates the number of x i taking the value A s,t under the sth feature, as shown in formula (19):
fs,t(xi)=|{xi∈D,xi,s=As,t}| (19)f s,t ( xi )=|{ xi ∈D,xi ,s =A s,t }| (19)
当且仅当ql=DOM(As),1≤s≤m并且满足公式(20)时,函数F(U,Q)被最小化:The function F(U,Q) is minimized if and only if q l =DOM(A s ), 1≤s≤m and formula (20) is satisfied:
为了使目标函数F(U,Q)达到极小值,改进后的基于簇内簇间相异度的k-modes算法(IKMCA)即一种基于簇内簇间相异度的分类型数据聚类方法,如图8所示,描述如下:In order to make the objective function F(U,Q) reach the minimum value, the improved k-modes algorithm based on intra-cluster dissimilarity between clusters (IKMCA) is a classification data aggregation method based on intra-cluster dissimilarity. The class method, as shown in Figure 8, is described as follows:
输入:具有n个数据对象,m个分类型特征的分类型数据集D。输出:聚类完成的簇集合C={C1,C2,...,Ck}。Input: a categorical dataset D with n data objects and m categorical features. Output: the completed cluster set C={C 1 ,C 2 ,...,C k }.
step1.通过公式(1)计算相异度di,j,并得到相异矩阵dn×n;step1. Calculate the degree of dissimilarity d i,j by formula (1), and obtain the dissimilarity matrix d n×n ;
step2.根据公式(14)计算截断距离dc;step2. Calculate the cut-off distance dc according to formula (14);
step3.利用公式(7)或公式(8)计算局部邻域密度ρi;step3. Use formula (7) or formula (8) to calculate the local neighborhood density ρ i ;
step4.利用公式(9)计算相对距离Li;step4. Utilize the formula (9) to calculate the relative distance L i ;
step5.根据公式Zi=ρi×Li,计算得到Zi={Z1,Z2,...,Zn};step5. According to the formula Z i =ρ i ×L i , calculate Z i ={Z 1 ,Z 2 ,...,Z n };
step6.将Zi降序排序,得到排序序列Z(1)>Z(2)>...>Z(n)。以数据对象xi的下标为横坐标,以Zi为纵坐标绘制Zi决策图,确定图中的拐点,拐点处的横坐标值即为最佳k值。step6. Sort Z i in descending order to obtain the sorted sequence Z (1)> Z (2) >...>Z (n) . Use the subscript of the data object x i as the abscissa, and Z i as the ordinate to draw the Z i decision-making diagram, determine the inflection point in the graph, and the abscissa value at the inflection point is the optimal k value.
Step7.确定k值和初始簇中心集合q(0)={q1,q2,...,qk};Step7. Determine k value and initial cluster center set q (0) ={q 1 ,q 2 ,...,q k };
Step8.根据公式(6)计算数据集中n-k个数据对象与k个初始簇中心之间的相异度d(xi,ql);Step8. Calculate the dissimilarity d(x i , q l ) between nk data objects and k initial cluster centers in the data set according to formula (6);
Step9.根据就近原则将数据对象分配到离它最近的初始簇中去,分配完成后,得到k个聚类簇C(1)={C1,C2,...,Ck},标记这(n-k)个数据对象的簇标签;Step9. According to the principle of proximity, allocate the data object to the initial cluster closest to it. After the allocation is completed, k clusters C (1) = {C 1 ,C 2 ,...,C k } are obtained, marked The cluster label of these (nk) data objects;
Step10.在新形成的聚类簇上根据定理1更新簇中心q(1)={q1,q2,..,qk};Step10. Update the cluster center q (1) ={q 1 ,q 2 ,..,q k } according to Theorem 1 on the newly formed cluster;
Step11.重复步骤8-11,直到目标函数值不再发生变化。如果目标函数值不再发生变化,则算法结束;否则重复Step8继续执行;Step11. Repeat steps 8-11 until the objective function value no longer changes. If the objective function value no longer changes, the algorithm ends; otherwise, repeat Step8 to continue execution;
Step12.算法结束,完成聚类。Step12. The algorithm ends and the clustering is completed.
假设l是算法收敛所需的迭代次数,通常情况下n>>m,k,l。IKMAC算法的时间复杂度主要是在每次迭代中更新簇中心和相异度。初始化簇中心需要人工观察决策图决定,因此此阶段的时间复杂度暂不考虑进总体算法中。使用簇内簇间相异度在每次迭代中更新簇中心和相异度计算的时间复杂度是l(O(nmk)+O(nmk))=O(nmkl),所以IKMCA的总时间复杂度是O(nmkl)。从上述分析可以发现,使簇内簇间相异度的IKMCA算法的时间复杂度相对数数据对象的数量,聚类个数和特征个数是线性可缩放的。Suppose l is the number of iterations required for the algorithm to converge, usually n>>m,k,l. The time complexity of the IKMAC algorithm is mainly to update the cluster center and dissimilarity in each iteration. Initializing the cluster center requires manual observation of the decision graph, so the time complexity at this stage is not considered in the overall algorithm for the time being. The time complexity of updating the cluster center and dissimilarity calculation in each iteration using intra-cluster and inter-cluster dissimilarity is l(O(nmk)+O(nmk))=O(nmkl), so the total time complexity of IKMCA The degree is O(nmkl). From the above analysis, it can be found that the time complexity of the IKMCA algorithm that makes the intra-cluster and inter-cluster dissimilarity is linearly scalable with respect to the number of data objects, the number of clusters and the number of features.
为了评估提出算法的有效性,下面分别从聚类精度AC、纯度PR、召回率RE三个指标对聚类结果进行评价,分别如公式(21)~公式(23)所示。NUM+表示被正确划分到簇Cl的数据对象个数;NUM-表示没有被正确划分到簇Cl的数据对象个数;NUM*表示应该被划分到簇Cl但实际上没有被划分到簇Cl的数据对象个数。聚类结果与数据集的真实划分越接近,AC、PE和RE的值就越大,算法越有效。In order to evaluate the effectiveness of the proposed algorithm, the clustering results are evaluated from the following three indicators: clustering accuracy AC, purity PR, and recall RE, as shown in formula (21) to formula (23). NUM+ indicates the number of data objects that are correctly divided into cluster C1 ; NUM- indicates the number of data objects that are not correctly divided into cluster C1 ; NUM* indicates that they should be divided into cluster C1 but are not actually divided into clusters The number of data objects of C l . The closer the clustering result is to the real division of the data set, the greater the values of AC, PE and RE, and the more effective the algorithm is.
算法用Python语言实现,所有实验均在intel(R)Core(TM)处理器i7-8700K CPU@3.70GHz,Windows 10操作系统上运行。使用数据集来自真实数据集。UCI是加州大学欧文分校提供的专门用于机器学习的真实数据集。为了测试算法的有效性,从UCI数据集中选取Mushroom(简称Mus),Breast-cancer(简称Bre),Car和Soybean-small(简称Soy)数据集进行实验验证。表5列出了这些数据集详细信息。The algorithm is implemented in Python language, and all experiments are run on the intel(R) Core(TM) processor i7-8700K CPU@3.70GHz, Windows 10 operating system. The datasets used are from real datasets. UCI is a real-world dataset dedicated to machine learning provided by the University of California, Irvine. In order to test the effectiveness of the algorithm, Mushroom (Mus for short), Breast-cancer (Bre for short), Car and Soybean-small (Soy for short) data sets were selected from the UCI data set for experimental verification. Table 5 lists these dataset details.
表5数据集描述Table 5 Dataset description
将本发明提出的IKMCA算法与Huang提出的k-modes算法、Ng等人提出IDMKCA算法和Ravi等人提出EKACMD算法分别运行30次取平均值。AC、PE和RE的计算结果如表6~表9所示。The IKMCA algorithm proposed by the present invention and the k-modes algorithm proposed by Huang, the IDMKCA algorithm proposed by Ng et al. and the EKACMD algorithm proposed by Ravi et al. were run 30 times to obtain the average value. The calculation results of AC, PE and RE are shown in Table 6 to Table 9.
表6四种算法在Mus数据集下的实验结果Table 6 Experimental results of the four algorithms in the Mus dataset
表7四种算法在Bre数据集下的实验结果Table 7 Experimental results of four algorithms in Bre dataset
表8四种算法在Car数据集下的实验结果Table 8 Experimental results of the four algorithms in the Car dataset
表9四种算法在Soy数据集下的实验结果Table 9 Experimental results of the four algorithms in the Soy dataset
从上述实验结果可以看出,对于Mus、Bre、Car和Soy数据集而言,大多数情况下IKMCA在AC、PR和RE上优于k-modes算法、IDMKCA算法和EKACMD算法。IMKCA算法优于经典k-modes算法的原因是,k-modes算法的预处理破坏了分类型特征的原始结构。转换后的分类特征值使用简单汉明距离计算相异度系数并不能揭示分类型数据之间的相异度。当数据集的特征非常多时,简单的0-1对比可能产生非常大的相异度,也可能产生非常小的相异异度甚至是相异差异度。跟经典k-modes算法和IDMKCA、EKACMD算法相比,提出的簇内簇间相异度可以更好地揭示数据集的结构。From the above experimental results, it can be seen that for the Mus, Bre, Car, and Soy datasets, IKMCA outperforms the k-modes algorithm, IDMKCA algorithm, and EKACMD algorithm in AC, PR, and RE in most cases. The reason why the IMKCA algorithm is superior to the classic k-modes algorithm is that the preprocessing of the k-modes algorithm destroys the original structure of the classification feature. Calculating the dissimilarity coefficient using simple Hamming distance for transformed categorical feature values does not reveal the dissimilarity between categorical data. When there are many features in the data set, a simple 0-1 comparison may produce a very large degree of dissimilarity, or may produce a very small degree of dissimilarity or even a degree of dissimilarity. Compared with the classical k-modes algorithm and the IDMKCA and EKACMD algorithms, the proposed intra-cluster and inter-cluster dissimilarity can better reveal the structure of the data set.
经典k-modes算法使用简单汉明距离进行相异度计算,弱化了类内相似性,忽略了簇间相似性。针对这些问题,本发明基于簇内簇间相似性提出新的相异度计算方法。该方法可以防止聚类过程中的重要特征值的丢失,强化了簇内特征值之间的相似性,弱化了簇间特征值之间的相似性。提出的簇中心自动选择方法大大减少了随机选取簇中心或者手动选择选取簇中心给聚类带来的误差。本发明用一些说明性例子讨论了k-modes算法中使用简单汉明距离等其他几种相异度系数的局限性,并提出了一种新的相异度系数。基于本发明提出相异度系数改进的分类型数据聚类算法与基于其他相异度系数的k-modes算法在UCI数据集上进行了实验。实验结果表明,本发明提出的相异度系数计算方法保留了数据的特征,做到了低簇内相异度高簇间相异性的标准,在聚类精度、纯度和召回率方面均有提高,有效提高了分类型数据的聚类效果。The classic k-modes algorithm uses simple Hamming distance to calculate the dissimilarity, which weakens the intra-class similarity and ignores the inter-cluster similarity. To solve these problems, the present invention proposes a new calculation method of dissimilarity based on the similarity between clusters within a cluster. This method can prevent the loss of important eigenvalues in the clustering process, strengthen the similarity between eigenvalues within a cluster, and weaken the similarity between eigenvalues between clusters. The proposed automatic cluster center selection method greatly reduces the errors brought by random selection of cluster centers or manual selection of cluster centers to clustering. The present invention discusses the limitation of using other dissimilarity coefficients such as simple Hamming distance in the k-modes algorithm with some illustrative examples, and proposes a new dissimilarity coefficient. The classification data clustering algorithm based on the improved dissimilarity coefficient proposed by the present invention and the k-modes algorithm based on other dissimilarity coefficients are tested on the UCI data set. Experimental results show that the dissimilarity coefficient calculation method proposed by the present invention retains the characteristics of the data, achieves the standard of low intra-cluster dissimilarity and high inter-cluster dissimilarity, and improves clustering accuracy, purity and recall. Effectively improve the clustering effect of classified data.
本发明所提出算法(IKMCA)的可应用于包含海量纯分类型数据的康复治疗方案推荐系统,在康复治疗方案推荐系统上,分类型数据聚类可以帮助分析人员从患者数据库中区分出不同的患者群体,发现数据库中分布的一些深层的信息,制定个性化的康复方案。The algorithm (IKMCA) proposed by the present invention can be applied to a rehabilitation treatment plan recommendation system containing massive pure classification data. On the rehabilitation treatment plan recommendation system, classification data clustering can help analysts distinguish different patients from the patient database. Patient groups, discover some in-depth information distributed in the database, and formulate personalized rehabilitation plans.
需要说明的是,尽管以上本发明所述的实施例是说明性的,但这并非是对本发明的限制,因此本发明并不局限于上述具体实施方式中。在不脱离本发明原理的情况下,凡是本领域技术人员在本发明的启示下获得的其它实施方式,均视为在本发明的保护之内。It should be noted that although the above-mentioned embodiments of the present invention are illustrative, they are not intended to limit the present invention, so the present invention is not limited to the above specific implementation manners. Without departing from the principles of the present invention, all other implementations obtained by those skilled in the art under the inspiration of the present invention are deemed to be within the protection of the present invention.
Claims (2)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011009696.6A CN112132217B (en) | 2020-09-23 | 2020-09-23 | A Categorical Data Clustering Method Based on Intra-cluster Dissimilarity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011009696.6A CN112132217B (en) | 2020-09-23 | 2020-09-23 | A Categorical Data Clustering Method Based on Intra-cluster Dissimilarity |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112132217A CN112132217A (en) | 2020-12-25 |
CN112132217B true CN112132217B (en) | 2023-08-15 |
Family
ID=73841250
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011009696.6A Active CN112132217B (en) | 2020-09-23 | 2020-09-23 | A Categorical Data Clustering Method Based on Intra-cluster Dissimilarity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112132217B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114970649B (en) * | 2021-02-23 | 2024-07-26 | 广东精点数据科技股份有限公司 | Network information processing method based on clustering algorithm |
CN117853152B (en) * | 2024-03-07 | 2024-05-17 | 云南疆恒科技有限公司 | Business marketing data processing system based on multiple channels |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6049797A (en) * | 1998-04-07 | 2000-04-11 | Lucent Technologies, Inc. | Method, apparatus and programmed medium for clustering databases with categorical attributes |
CN107122793A (en) * | 2017-03-23 | 2017-09-01 | 北京航空航天大学 | A kind of improved global optimization k modes clustering methods |
CN107358368A (en) * | 2017-07-21 | 2017-11-17 | 国网四川省电力公司眉山供电公司 | A kind of robust k means clustering methods towards power consumer subdivision |
CN108510010A (en) * | 2018-04-17 | 2018-09-07 | 中国矿业大学 | A kind of density peaks clustering method and system based on prescreening |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170371886A1 (en) * | 2016-06-22 | 2017-12-28 | Agency For Science, Technology And Research | Methods for identifying clusters in a dataset, methods of analyzing cytometry data with the aid of a computer and methods of detecting cell sub-populations in a plurality of cells |
-
2020
- 2020-09-23 CN CN202011009696.6A patent/CN112132217B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6049797A (en) * | 1998-04-07 | 2000-04-11 | Lucent Technologies, Inc. | Method, apparatus and programmed medium for clustering databases with categorical attributes |
CN107122793A (en) * | 2017-03-23 | 2017-09-01 | 北京航空航天大学 | A kind of improved global optimization k modes clustering methods |
CN107358368A (en) * | 2017-07-21 | 2017-11-17 | 国网四川省电力公司眉山供电公司 | A kind of robust k means clustering methods towards power consumer subdivision |
CN108510010A (en) * | 2018-04-17 | 2018-09-07 | 中国矿业大学 | A kind of density peaks clustering method and system based on prescreening |
Non-Patent Citations (1)
Title |
---|
一种面向混合型数据聚类的k-prototypes聚类算法;贾子琪;宋玲;;小型微型计算机系统(第09期);1845-1852 * |
Also Published As
Publication number | Publication date |
---|---|
CN112132217A (en) | 2020-12-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | A nonparametric statistical approach to clustering via mode identification | |
CN102663100A (en) | Two-stage hybrid particle swarm optimization clustering method | |
CN109409400A (en) | Merge density peaks clustering method, image segmentation system based on k nearest neighbor and multiclass | |
CN111368891B (en) | A K-Means Text Classification Method Based on Immune Cloning Gray Wolf Optimization Algorithm | |
CN110781295B (en) | Multi-label data feature selection method and device | |
Xu et al. | A feasible density peaks clustering algorithm with a merging strategy | |
CN112132217B (en) | A Categorical Data Clustering Method Based on Intra-cluster Dissimilarity | |
CN107832791A (en) | A kind of Subspace clustering method based on the analysis of higher-dimension overlapped data | |
CN113807456A (en) | A Mutual Information Based Feature Screening and Association Rules Multi-label Classification Algorithm | |
CN107291936A (en) | The hypergraph hashing image retrieval of a kind of view-based access control model feature and sign label realizes that Lung neoplasm sign knows method for distinguishing | |
CN109002858A (en) | A kind of clustering ensemble method based on evidential reasoning for user behavior analysis | |
CN111914930A (en) | Density peak value clustering method based on self-adaptive micro-cluster fusion | |
Wang et al. | Information theoretical clustering via semidefinite programming | |
Zhong et al. | An improved k-NN classification with dynamic k | |
Ding et al. | Density peaks clustering algorithm based on improved similarity and allocation strategy | |
Zhang et al. | Chameleon algorithm based on improved natural neighbor graph generating sub-clusters | |
CN105160598B (en) | Power grid service classification method based on improved EM algorithm | |
CN106469318A (en) | A kind of characteristic weighing k means clustering method based on the sparse restriction of L2 | |
Yousefnezhad et al. | Weighted spectral cluster ensemble | |
CN102722578A (en) | Unsupervised cluster characteristic selection method based on Laplace regularization | |
Cheng et al. | A local cores-based hierarchical clustering algorithm for data sets with complex structures | |
Li et al. | A novel rough fuzzy clustering algorithm with a new similarity measurement | |
CN107886130A (en) | A kind of kNN rapid classification methods based on cluster and Similarity-Weighted | |
CN107392249A (en) | A kind of density peak clustering method of k nearest neighbor similarity optimization | |
Peng et al. | FNC: A fast neighborhood calculation framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |