CN112132217B

CN112132217B - A Categorical Data Clustering Method Based on Intra-cluster Dissimilarity

Info

Publication number: CN112132217B
Application number: CN202011009696.6A
Authority: CN
Inventors: 宋玲; 贾子琪; 叶进; 陈燕; 王立颖; 石森煌
Original assignee: Guangxi University
Current assignee: Guangxi University
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2023-08-15
Anticipated expiration: 2040-09-23
Also published as: CN112132217A

Abstract

The invention discloses a classification type data clustering method based on inter-cluster dissimilarity among clusters, which provides a new dissimilarity calculating method based on inter-cluster similarity among clusters, and completes automatic selection of a cluster center based on the dissimilarity. The dissimilarity degree of the invention reserves the characteristics of data, achieves the standard of dissimilarity between low clusters and high clusters, improves the clustering precision, purity and recall rate, effectively improves the clustering effect of the classified data, can prevent the loss of important characteristic values in the clustering process, strengthens the similarity between the characteristic values in the clusters, and weakens the similarity between the characteristic values in the clusters; by the automatic cluster center selection method, errors caused by randomly selecting the cluster center or manually selecting the cluster center to cluster are greatly reduced.

Description

A Categorical Data Clustering Method Based on Intra-cluster Dissimilarity

技术领域technical field

本发明涉及数据聚类技术领域，具体涉及一种基于簇内簇间相异度的分类型数据聚类方法。The invention relates to the technical field of data clustering, in particular to a clustering method for classified data based on intra-cluster and inter-cluster dissimilarity.

背景技术Background technique

聚类算法是机器学习中涉及对数据进行分组的一种算法。在给定的数据集中，我们可以通过聚类算法将其分成一些不同的组。在理论上，相同的组的数据之间有相同的属性或者是特征，不同组数据之间的属性或者特征相差就会比较大。聚类算法是一种非监督学习算法，并且作为一种常用的数据分析算法在很多领域上得到应用。A clustering algorithm is an algorithm in machine learning that involves grouping data. In a given dataset, we can divide it into some different groups by clustering algorithm. In theory, the data of the same group have the same attributes or characteristics, and the attributes or characteristics of different groups of data will differ greatly. Clustering algorithm is a kind of unsupervised learning algorithm, and as a commonly used data analysis algorithm, it has been applied in many fields.

在数据科学领域，我们利用聚类算法实现聚类分析，通过将数据分组可以比较清晰的获取到数据信息。分类型数据聚类算法作为数据挖掘的重要组成部分，可以帮助分析人员从康复治疗方案推荐系统的数据库中概括出每一类患者的病情特点，让分析人员把注意力放在某一个特定的患者群体上，以做出进一步的分析。In the field of data science, we use clustering algorithms to implement cluster analysis, and we can obtain data information more clearly by grouping data. As an important part of data mining, the classification data clustering algorithm can help analysts summarize the disease characteristics of each type of patient from the database of the rehabilitation treatment plan recommendation system, allowing analysts to focus on a specific patient groups for further analysis.

经典k-means算法在计算簇的均值以及数据对象之间的相异度时使用的是欧式距离，仅适用于连续特征的数值型数据集，对于离散特征的分类型数据集，k-means算法不再适用。Huang在1998年对k-means算法进行扩展，使用“modes”代替“means”，提出适用于分类型数据聚类的k-modes算法。k-modes算法采用简单汉明距离计算相异度，忽略了数据对象间同一分类特征的差异性，弱化了簇内相似性，没有充分反映同一分类特征下两个特征值之间的相异度，影响聚类结果的精度。另外，k-modes算法采用随机选择的方法确定初始簇中心和k值，采用基于频率的方法重新计算和更新簇中心，给聚类结果带来很大的不确定性。The classic k-means algorithm uses the Euclidean distance when calculating the mean value of the cluster and the dissimilarity between data objects, which is only suitable for numerical data sets with continuous features. For categorical data sets with discrete features, the k-means algorithm No longer applicable. Huang extended the k-means algorithm in 1998, using "modes" instead of "means", and proposed a k-modes algorithm suitable for categorical data clustering. The k-modes algorithm uses a simple Hamming distance to calculate the dissimilarity, which ignores the difference of the same classification feature between data objects, weakens the similarity within the cluster, and does not fully reflect the dissimilarity between two feature values under the same classification feature. , affecting the accuracy of clustering results. In addition, the k-modes algorithm uses a random selection method to determine the initial cluster center and k value, and uses a frequency-based method to recalculate and update the cluster center, which brings great uncertainty to the clustering results.

发明内容Contents of the invention

本发明所要解决的是k-modes算法的精度和初始簇中心选择的问题，提供一种基于簇内簇间相异度的分类型数据聚类方法。The present invention aims to solve the problem of the accuracy of the k-modes algorithm and the selection of the initial cluster center, and provides a classification data clustering method based on the dissimilarity between clusters within a cluster.

为解决上述问题，本发明是通过以下技术方案实现的：In order to solve the above problems, the present invention is achieved through the following technical solutions:

一种基于簇内簇间相异度的分类型数据聚类方法，包括步骤如下：A clustering method for classified data based on intra-cluster and inter-cluster dissimilarity, comprising the following steps:

步骤1、对于具有n个数据对象的分类型数据集D，利用简单汉明距离计算每2个数据对象之间的相异度d_i,j；Step 1. For a classified data set D with n data objects, use the simple Hamming distance to calculate the dissimilarity d _i,j between every two data objects;

步骤2、对于分类型数据集D的每个数据对象x_i，先将该数据对象x_i与其他数据对应之间的相异度d_i,j进行升序排序，得到该数据对象x_i的相异度向量d′_i,j＝[d′_i,1,d′_i,2,...,d′_i,n]；再将该相异度向量d′_i,j中相邻两个相异度的最大差值作为数据对象x_i的截断距离d_c,i；Step 2. For each data object x _i of the sub-type data set D, first sort the dissimilarity d _{i, j} between the data object x _i and other data correspondences in ascending order, and obtain the corresponding data object x _i Dissimilarity vector d′ _i,j =[d′ _i,1 ,d′ _i,2 ,...,d′ _i,n ]; then the two adjacent dissimilarity vectors d′ _i,j The maximum difference of the dissimilarity is taken as the cut-off distance d _c,i of the data object x _i ;

步骤3、选取分类型数据集D中所有数据对象的截断距离d_c,i的最小值作为分类型数据集D的截断距离d_c；Step 3. Select the minimum value of the cut-off distance dc _,i of all data objects in the classification data set D as the cut-off distance _dc of the classification data set D;

步骤4、基于分类型数据集D的截断距离d_c，并利用方波内核函数法或高斯核函数法计算分类型数据集D的每个数据对象x_i的局部邻域密度ρ_i；Step 4. Calculate the local neighborhood density _ρ i of each data object x _i in the classification data set D based on the cut-off distance d _c of the classification data set D by using the square wave kernel function method or the Gaussian kernel function method;

步骤5、计算分类型数据集D的每个数据对象x_i的相对距离L_i：Step 5. Calculate the relative distance L _i of each data object x _i in the classification data set D:

步骤6、对于分类型数据集D的每个数据对象x_i，利用该数据对象x_i的局部邻域密度ρ_i和相对距离L_i得到该数据对象x_i的决策图Z_i：Step 6. For each data object _xi in the classification data set D, use the local neighborhood density ρ _i and the relative distance L _i of the data object _xi to obtain the decision graph Z _i of the data object _xi :

Z_i＝ρ_i×L_i Z _i =ρ _i ×L _i

步骤7、先将分类型数据集D中所有数据对象的决策图Z_i进行降序排序，得到排序序列；再基于该排序序列，以数据对象x_i的下标i为横坐标，以数据对象x_i的决策图Z_i为纵坐标绘制分类型数据集D的决策图，该分类型数据集D的决策图的中的拐点处的横坐标即为选定的聚类个数k；Step 7. First sort the decision graph Z _i of all data objects in the classification data set D in descending order to obtain a sorting sequence; then based on the sorting sequence, use the subscript i of the data object x _i as the abscissa, and the data object x The decision diagram _Z of _i is the vertical coordinate to draw the decision diagram of the classification data set D, and the abscissa at the inflection point in the decision diagram of the classification data set D is the selected number of clusters k;

步骤8、从分类型数据集D中选择k个数据对象构成当前簇中心集合；Step 8, select k data objects from the classification data set D to form the current cluster center set;

步骤9、基于当前簇中心集合，计算分类型数据集D剩余的n-k个数据对象x_i与k个簇中心q_l之间的相异度d(x_i,q_l)：Step 9. Based on the current set of cluster centers, calculate the dissimilarity d( _xi ,q _l ) between the remaining nk data objects _xi and k cluster centers q _l in the subtype data set D:

步骤10、根据数据对象x_i与簇中心q_l之间的相异度d(x_i,q_l)，并基于就近原则将n-k个数据对象分配到离它最近的簇中，分配完成后，得到k个聚类簇，并标记这n-k个数据对象的簇标签，由此获得基于当前簇中心集合的聚类结果；Step 10. According to the dissimilarity d( _xi ,q _l ) between the data object _xi and the cluster center q _l , and based on the principle of proximity, assign nk data objects to the nearest cluster. After the assignment is completed, Obtain k clusters, and mark the cluster labels of these nk data objects, thereby obtaining the clustering result based on the current cluster center set;

步骤11、对于形成的k个聚类簇，从每个簇中选取每维特征上出现频率最高的特征值组成该簇新的簇中心，得到新的簇中心集合；Step 11. For the formed k clusters, select the eigenvalue with the highest frequency of occurrence on each dimension feature from each cluster to form a new cluster center of the cluster, and obtain a new set of cluster centers;

步骤12、重复步骤9-11，直到各簇中心不再变化时或达到规定的最大迭代次数时，算法终止，输出基于当前簇中心集合的聚类结果；否则，将所得到的新的簇中心集合作为当前簇中心集合，并跳至步骤9继续迭代；Step 12. Repeat steps 9-11 until each cluster center no longer changes or reaches the specified maximum number of iterations, the algorithm terminates, and the clustering result based on the current cluster center set is output; otherwise, the obtained new cluster center Set as the current cluster center set, and skip to step 9 to continue iteration;

迭代使得选取的簇中心越来越接近真实的簇中心，所以迭代过程会使聚类效果越来越好。聚类算法结束条件可由实验员根据实际情况具体选择：(1)迭代到达到最大迭代次数终止；(2)迭代到目标函数阈值终止。Iteration makes the selected cluster center closer to the real cluster center, so the iterative process will make the clustering effect better and better. The end conditions of the clustering algorithm can be specifically selected by the experimenter according to the actual situation: (1) iterate until the maximum number of iterations is reached; (2) iterate until the objective function threshold is terminated.

其中，i,j＝1,2,…,n，n为分类型数据集D的数据对象个数；s＝1,2,…,m，m为数据对象的特征的个数；l＝1,2,…,k，为聚类个数；δ(A_i,s,A_ql,s)为第s维特征下数据对象x_i与蔟中心q_l的相异度；A_i,s为数据对象x_i的第s维特征；A_ql,s为蔟中心q_l的第s维特征；为簇C_l内，特征值为A_s,t的数据对象个数；|C_l|为簇C_l内的数据对象的个数；ζ_l为调节系数。Among them, i, j=1,2,...,n, n is the number of data objects in the classification data set D; s=1,2,...,m, m is the number of features of the data objects; l=1 ,2,...,k, is the number of clusters; δ(A _i,s ,A _ql,s ) is the dissimilarity between the data object x _i and cluster center q _l under the s-th dimension feature; A _i,s is The s-th dimension feature of the data object x _i ; A _ql,s is the s-th dimension feature of the cluster center q _l ; is the number of data objects whose feature value is _As,t in cluster C _l ; |C _l | is the number of data objects in cluster C _l ; ζ _l is the adjustment coefficient.

上述步骤4中，对于数据对象大于等于10TB的大规模分类型数据集D，利用方波内核函数法计算数据对象x_i的局部邻域密度ρ_i；对于数据对象小于10TB的小规模分类型数据集D，利用高斯核函数法计算数据对象x_i的局部邻域密度ρ_i。注：大规模数据一般指在10TB(1TB＝1024GB)规模以上的数据量。In the above step 4, for the large-scale classification data set D whose data object is greater than or equal to 10TB, use the square wave kernel function method to calculate the local neighborhood density ρ _i of the data object x _i ; for the small-scale classification data whose data object is less than 10TB Set D, use the Gaussian kernel function method to calculate the local neighborhood density ρ _i of the data object _xi . Note: Large-scale data generally refers to the amount of data above 10TB (1TB=1024GB).

与现有技术相比，本发明具有如下特点：Compared with prior art, the present invention has following characteristics:

1、本发明基于簇内簇间相似性提出新的相异度计算方法，其可以防止聚类过程中的重要特征值的丢失，强化了簇内特征值之间的相似性，弱化了簇间特征值之间的相似性。1. The present invention proposes a new dissimilarity calculation method based on the similarity between clusters within a cluster, which can prevent the loss of important eigenvalues in the clustering process, strengthen the similarity between eigenvalues within a cluster, and weaken the similarity between clusters. Similarity between eigenvalues.

2、本发明提出的簇中心自动选择方法，大大减少了随机选取簇中心或者手动选择选取簇中心给聚类带来的误差。2. The automatic selection method of the cluster center proposed by the present invention greatly reduces the error brought to the clustering by randomly selecting the cluster center or manually selecting the cluster center.

3、本发明提出的相异度系数计算方法保留了数据的特征，做到了低簇内相异度高簇间相异性的标准，在聚类精度、纯度和召回率方面均有提高，有效提高了分类型数据的聚类效果。3. The dissimilarity coefficient calculation method proposed by the present invention retains the characteristics of the data, achieves the standard of low intra-cluster dissimilarity and high inter-cluster dissimilarity, improves clustering accuracy, purity and recall rate, and effectively improves Clustering effect of categorical data.

附图说明Description of drawings

图1为k-modes算法对初始簇中心选取的敏感性示意图，(a)k＝1的聚类结果图，(b)k＝2的聚类结果图，(c)k＝3的聚类结果图。Figure 1 is a schematic diagram of the sensitivity of the k-modes algorithm to the selection of the initial cluster center, (a) the clustering result graph of k=1, (b) the clustering result graph of k=2, (c) the clustering result graph of k=3 Result graph.

图2为x_i的局部邻域密度不是最大密度时的情况图。Figure 2 is a diagram of the situation when the local neighborhood density of x _i is not the maximum density.

图3为x_i的局部邻域密度是最大密度时的情况图。Figure 3 is a diagram of the situation when the local neighborhood density of x _i is the maximum density.

图4为d_c,i值的确定图。Figure 4 is a diagram of the determination of d _{c, i} values.

图5为二维数据集示意图。Figure 5 is a schematic diagram of a two-dimensional data set.

图6为决策图。Figure 6 is a decision diagram.

图7为Z_i决策图。Fig. 7 is a decision diagram of Z _i .

图8为IKMCA流程图。Figure 8 is a flowchart of IKMCA.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实例，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific examples.

使用的相关符号及含义说明如表1所示。The symbols used and their meanings are shown in Table 1.

表1符号说明Table 1 Symbol Description

以数据对象x_i和簇中心q_l为例，定义经典k-modes算法的简单汉明距离，如公式(1)所示，此计算赋予各特征相同的权重。Taking the data object _xi and the cluster center q _l as examples, define the simple Hamming distance of the classic k-modes algorithm, as shown in formula (1), this calculation gives each feature the same weight.

其中： in:

k-modes算法通过简单汉明距离来最小化的目标函数。如公式(2)所示：The k-modes algorithm minimizes the objective function by simple Hamming distance. As shown in formula (2):

在相异度系数上，经典k-modes算法的相异度系数没有考虑簇内特征值出现的相对频率，也没有考虑各特征的簇内簇间结构。导致新数据对象划分过程中，一些簇分配了较少的相似数据。为了方便说明，采用如表2所示的人工数据集D₁对相异度系数进行论证。D₁由三个特征描述A＝{A₁,A₂,A₃}。其中，DOM(A₁)＝{A,B},DOM(A₂)＝{E,F}，DOM(A₃)＝{H,I}。D₁有两个聚类簇C₁和C₂，分别对应的簇中心q₁(A,E,H)和q₂(A,E,H)。In terms of the dissimilarity coefficient, the dissimilarity coefficient of the classical k-modes algorithm does not consider the relative frequency of eigenvalues within a cluster, nor does it consider the intra-cluster and inter-cluster structure of each feature. In the process of partitioning new data objects, some clusters allocate less similar data. For the convenience of explanation, the artificial data set _D1 shown in Table 2 is used to demonstrate the dissimilarity coefficient. D ₁ is described by three features A={A ₁ , A ₂ , A ₃ }. Wherein, DOM(A ₁ )={A,B}, DOM(A ₂ )={E,F}, DOM(A ₃ )={H,I}. D ₁ has two clusters C ₁ and C ₂ , corresponding to cluster centers q ₁ (A,E,H) and q ₂ (A,E,H) respectively.

表2人工数据集D₁ Table 2 Artificial Dataset D ₁

假设需要对x₇＝(A,E,H)进行聚类划分，使用简单满名距离可得d(x₇,q₁)＝d(x₇,q₂)＝0+0+0＝0。但以簇内相似性而言，应该将x₇划分给簇C₁。Assuming that x ₇ =(A,E,H) needs to be clustered, using the simple full-name distance, d(x ₇ ,q ₁ )=d(x ₇ ,q ₂ )=0+0+0=0 . But in terms of intra-cluster similarity, x ₇ should be assigned to cluster C ₁ .

在初始簇中心选择上，经典k-modes算法对初始簇中心非常敏感，初始簇中心的选择采用随机初始化法或者人工设置法，这两种方法都在一定程度上导致了聚类结果不稳定。选择不同位置和k值的初始簇中心，会产生不同的聚类结果。如图1所示，该数据集的真实簇数是3。选择不同初始簇中心，设置不同的k值，可能产生不同聚类结果，图1内容从左到右依次为：随机选取初始簇中心，聚类迭代过程，最终聚类结果。可见寻找合适的初始簇中心非常重要。In the selection of the initial cluster center, the classical k-modes algorithm is very sensitive to the initial cluster center, and the selection of the initial cluster center adopts random initialization method or manual setting method, both of which lead to unstable clustering results to a certain extent. Choosing initial cluster centers with different positions and k values will produce different clustering results. As shown in Figure 1, the true number of clusters for this dataset is 3. Selecting different initial cluster centers and setting different k values may produce different clustering results. The contents of Figure 1 from left to right are: random selection of initial cluster centers, clustering iteration process, and final clustering results. It can be seen that it is very important to find a suitable initial cluster center.

如果选用的相异度系数可以发现数据集内全部或部分潜在的modes，那么对基于划分的k-modes算法来说事半功倍，使k-modes算法产生高效的聚类结果需满足簇内数据对象之间的相异度最小；簇间数据对象之间的相异度最大的条件。因此，本发明基于簇内簇间相似性提出一种新的相异度系数“簇内簇间相异度系数”。基于簇内簇间相异度的k-modes算法(IKMCA)算法使用基于改进的密度峰值算法确定初始簇中心，使用簇内簇间相异度系数计算各数据对象与簇中心之间的相异度，并更新簇中心。If the selected dissimilarity coefficient can discover all or part of the potential modes in the data set, then the division-based k-modes algorithm will do more with less, so that the k-modes algorithm can produce efficient clustering results that meet the requirements of the data objects in the cluster. The minimum dissimilarity between clusters; the condition for the largest dissimilarity between data objects between clusters. Therefore, the present invention proposes a new dissimilarity coefficient "intra-cluster dissimilarity coefficient" based on intra-cluster and inter-cluster similarity. The k-modes algorithm (IKMCA) algorithm based on intra-cluster and inter-cluster dissimilarity uses the improved density peak algorithm to determine the initial cluster center, and uses the intra-cluster inter-cluster dissimilarity coefficient to calculate the dissimilarity between each data object and the cluster center degree, and update the cluster centers.

簇内簇间相异度系数考虑特征值在同一簇内分布的相对频率。属于同簇的数据对象，其相同的特征值出现的频率较高，簇内相似性也较高。簇内相异度定义如公式(3)所示：The within-cluster, between-cluster dissimilarity coefficient takes into account the relative frequency with which eigenvalues are distributed within the same cluster. Data objects belonging to the same cluster have higher frequency of occurrence of the same eigenvalue, and higher similarity within the cluster. The definition of intra-cluster dissimilarity is shown in formula (3):

其中：1≤i≤n,1≤s≤m。Among them: 1≤i≤n, 1≤s≤m.

使用数据集D₁，根据公式(3)可得d(x₁₀,q₁)＝(1-2/3)+(1-2/3)+(1-1)＝2/3，d(x₁₀,q₂)＝(1-2/3)+(1-2/3)+(1-2/3)＝1。由计算结果可知，x₇与簇C₁具有最小相异度，因此x₇应该被划分到簇C₁内。公式(3)虽然考虑了簇内特征值的相对频率，但没有考虑簇间特征值的分布。使用如表3所示人工数据集D₂讨论不考虑簇间相似度的缺陷。D₂由三个分类型特征描述A＝{A₁,A₂,A₃}。其中，DOM＝(A₁)＝{A,B,C}，DOM(A₂)＝{E,F},DOM(A₃)＝{H,I,J}。D₂有三个聚类簇C₁,C₂和C₃分别对应簇中心q₁＝(A,E,H)，q₂＝(A,E,H)和q₃＝(B,E,I)。Using data set D ₁ , according to formula (3), d(x ₁₀ ,q ₁ )=(1-2/3)+(1-2/3)+(1-1)=2/3, d( x ₁₀ ,q ₂ )=(1-2/3)+(1-2/3)+(1-2/3)=1. It can be seen from the calculation results that x ₇ has the minimum dissimilarity with cluster C ₁ , so x ₇ should be divided into cluster C ₁ . Although formula (3) considers the relative frequency of eigenvalues within a cluster, it does not consider the distribution of eigenvalues between clusters. The flaw of not considering the inter-cluster similarity is discussed using the artificial dataset _D2 shown in Table 3. D ₂ is described by three categorical features A={A ₁ , A ₂ , A ₃ }. Wherein, DOM=(A ₁ )={A,B,C}, DOM(A ₂ )={E,F}, DOM(A ₃ )={H,I,J}. D ₂ has three clusters C ₁ , C ₂ and C ₃ respectively corresponding to cluster centers q ₁ =(A,E,H), q ₂ =(A,E,H) and q ₃ =(B,E,I ).

表3人工数据集D₂ Table 3 Artificial dataset D ₂

假设需要对x₁₀＝(A,E,H)进行聚类划分。使用简单汉明距离可得d(x₁₀,q₁)＝d(x₁₀,q₂)＝d(x₁₀,q₃)＝0+0+0＝0。使用公式(3)可得d(x₁₀,q₁)＝(1-2/3)+(1-2/3)+(1-3/3)＝2/3，d(x₁₀,q₂)＝(1-2/3)+(1-3/3)+(1-2/3)＝2/3，d(x₁₀,q₃)＝1+0+1＝2。由上述计算结果可知，简单汉明距离不能对x₁₀进行聚类划分；公式(3)可以将x₁₀划分给簇C₁或簇C₂，即公式(3)无法准确地确定x₁₀的正确聚类划分。从“低簇内相异度高簇间相异度”角度观察数据集D₂，可知将x₁₀分配给簇C₁更合适。因为将x₁₀分配给簇C₁后，会让簇C₁和簇C₂之间的相异度最大化。Assume that x ₁₀ =(A, E, H) needs to be divided into clusters. Using the simple Hamming distance, d(x ₁₀ ,q ₁ )=d(x ₁₀ ,q ₂ )=d(x ₁₀ ,q ₃ )=0+0+0=0 can be obtained. Using formula (3), d(x ₁₀ ,q ₁ )=(1-2/3)+(1-2/3)+(1-3/3)=2/3, d(x ₁₀ ,q ₂ )=(1-2/3)+(1-3/3)+(1-2/3)=2/3, d(x ₁₀ ,q ₃ )=1+0+1=2. From the above calculation results, we can see that simple Hamming distance cannot cluster x ₁₀ ; formula (3) can divide x ₁₀ into cluster C ₁ or cluster C ₂ , that is, formula (3) cannot accurately determine the correctness of x ₁₀ cluster division. Observing the data set D ₂ from the perspective of "low intra-cluster dissimilarity and high inter-cluster dissimilarity", it can be seen that it is more appropriate to assign x ₁₀ to cluster C ₁ . Because after assigning x ₁₀ to cluster C ₁ , the dissimilarity between cluster C ₁ and cluster C ₂ will be maximized.

簇间相异度考虑特征值相对于所有簇分布的总频率。假设特征值仅在一个簇内频繁分布，意味该特征值和其它簇之间的差异性很大。簇内簇间相异度系数定义如公式(4)所示：Between-cluster dissimilarity considers the total frequency of feature values relative to all cluster distributions. Assuming that the eigenvalues are frequently distributed in only one cluster, it means that the eigenvalues are very different from other clusters. The definition of intra-cluster and inter-cluster dissimilarity coefficient is shown in formula (4):

其中：1≤i≤n，1≤s≤m。Among them: 1≤i≤n, 1≤s≤m.

使用公式(4)数据集D₂进行计算d(x₁₀,q₁)＝(1-2/3×2/4)+(1-2/3×2/8)+(1-3/3×3/5)＝1.9；d(x₁₀,q₂)＝(1-2/3×2/4)+(1-3/3×3/8)+(1-2/3×2/5)＝2.025；d(x₁₀,q₃)＝(1-0×1)+(1-3/3×3/8)+(1-0×1)＝2.625。根据公式(4)的计算结果可知，x₁₀与簇C₁之间的相异度更小，这个结果与之前的分析一致，成功的对x₁₀进行了聚类划分。下面使用公式(4)验证更为特殊的人工数据集D₃。如表4所示，D₃由三个特征描述A＝{A₁,A₂,A₃}。其中，DOM＝(A₁)＝{A,B},DOM(A₂)＝{E,F}，DOM(A₃)＝{H,I}，三个聚类簇C₁,C₂和C₃分别对应簇中心q₁＝(A,E,H),q₂＝(A,E,H)和q₃＝(A,E,H)，A,E和H在D₃中均匀分布，均出现6次。Use formula (4) to calculate data set D ₂ d(x ₁₀ ,q ₁ )=(1-2/3×2/4)+(1-2/3×2/8)+(1-3/3 ×3/5)=1.9; d(x ₁₀ ,q ₂ )=(1-2/3×2/4)+(1-3/3×3/8)+(1-2/3×2/ 5)=2.025; d(x ₁₀ ,q ₃ )=(1-0×1)+(1-3/3×3/8)+(1-0×1)=2.625. According to the calculation result of formula (4), it can be seen that the dissimilarity between x ₁₀ and cluster C ₁ is smaller. This result is consistent with the previous analysis, and x ₁₀ has been clustered successfully. Next, formula (4) is used to verify the more special artificial data set D ₃ . As shown in Table 4, D ₃ is described by three features A={A ₁ , A ₂ , A ₃ }. Among them, DOM=(A ₁ )={A,B}, DOM(A ₂ )={E,F}, DOM(A ₃ )={H,I}, three clusters C ₁ , C ₂ and C ₃ corresponds to cluster centers q ₁ = (A, E, H), q ₂ = (A, E, H) and q ₃ = (A, E, H), respectively, A, E and H are uniformly distributed in D ₃ , appearing 6 times.

表4人工数据集D₃ Table 4 Artificial Dataset D ₃

分别使用简单汉明距离、公式(3)和公式(4)对x₁₀(A,E,H)进行聚类划分。使用简单汉明距离可得d(x₁₀,q₁)＝d(x₁₀,q₂)＝d(x₁₀,q₃)＝0+0+0＝0；使用公式(3)可得d(x₁₀,q₁)＝d(x₁₀,q₂)＝d(x₁₀,q₃)＝(1-2/3)+(1-2/3)+(1-2/3)＝1；使用公式(4)可得d(x₁₀,q₁)＝d(x₁₀,q₂)＝d(x₁₀,q₃)＝(1-2/3×2/6)+(1-2/3×2/6)+(1-2/3×2/6)＝21/9。由上述计算结果可知，当特征值均匀分布时，上述三种相异度系数都无法正确的对x₁₀进行聚类划分。因此再一次考虑完善簇内簇间相异度系数。Use the simple Hamming distance, formula (3) and formula (4) to cluster and divide x ₁₀ (A, E, H). Use simple Hamming distance to get d(x ₁₀ ,q ₁ )=d(x ₁₀ ,q ₂ )=d(x ₁₀ ,q ₃ )=0+0+0=0; use formula (3) to get d (x ₁₀ ,q ₁ )=d(x ₁₀ ,q ₂ )=d(x ₁₀ ,q ₃ )=(1-2/3)+(1-2/3)+(1-2/3)= 1; use formula (4) to get d(x ₁₀ ,q ₁ )=d(x ₁₀ ,q ₂ )=d(x ₁₀ ,q ₃ )=(1-2/3×2/6)+(1 -2/3×2/6)+(1-2/3×2/6)=21/9. From the above calculation results, it can be known that when the eigenvalues are evenly distributed, none of the above three dissimilarity coefficients can correctly cluster x ₁₀ . Therefore, once again consider improving the intra-cluster and inter-cluster dissimilarity coefficient.

取数据对象x_i的特征值分布与所在簇的整体特征值分布进行比较，完善部分的定义如公式(5)所示：Take the eigenvalue distribution of the data object x _i and compare it with the overall eigenvalue distribution of the cluster. The definition of the perfect part is shown in formula (5):

其中：x_i是待划分数据对象，x_j是簇C_l内的数据对象。重新定义的簇内簇间相异度系数如公式(6)所示：Among them: x _i is the data object to be divided, x _j is the data object in the cluster C _l . The redefined intra-cluster and inter-cluster dissimilarity coefficient is shown in formula (6):

对任意x_i，x_j∈D，d均有以下性质：For any x _i, x _j ∈ D, d has the following properties:

自身距离：对所有x_i，每个对象与自身的距离等于零d(x_i,x_i)＝0。Self-distance: For all x _i , the distance of each object to itself is equal to zero d(x _i , x _i )=0.

对称性：对所有x_i和x_j，x_i到x_j的距离等于x_j到x_i的距离d(x_i,x_j)＝d(x_j,x_i)。Symmetry: For all x _i and x _j , the distance from x _i to x _j is equal to the distance d( _xi _, _{x j} ₎ =d(x _j , x _i ) from x j to x i.

非负性：对所有的x_i,x_j，距离d是个非负值，当且仅当x_i＝x_j时，d(x_i,x_j)＝0。Non-negativity: for all x _i , x _j , the distance d is a non-negative value, and d( _xi , x _j )=0 if and only when x _i =x _j .

满足三角不等式：对所有x_i和x_j，d(x_i,x_j)≤d(x_i,x_h)+d(x_h,x_j)。Satisfy the triangle inequality: for all x _i and x _j , d(x _i , x _j )≤d(x _i , x _h )+d(x _h , x _j ).

使用公式(6)的再次对D₃进行计算，可得因此，新的相异度系数计算结果为d(x₁₀,q₁)＝21/9+ζ₁＝2.6666，d(x₁₀,q₂)＝21/9+ζ₂＝2.8148，d(x₁₀,q₃)＝21/9+ζ₃＝2.8888。由以上计算结果可知应该将x₁₀划分给簇C₁。聚类划分结果符合实际情况，说明本发明提出的相异度系数方案可行。Using formula (6) to calculate D ₃ again, we can get Therefore, the calculation result of the new dissimilarity coefficient is d(x ₁₀ ,q ₁ )=21/9+ζ ₁ =2.6666, d(x ₁₀ ,q ₂ )=21/9+ζ ₂ =2.8148, d(x ₁₀ ,q ₃ )=21/9+ζ ₃ =2.8888. From the above calculation results, it can be seen that x ₁₀ should be divided into cluster C ₁ . The result of clustering and division conforms to the actual situation, which shows that the dissimilarity coefficient scheme proposed by the present invention is feasible.

2014年，Rodriguez等人提出密度峰值(DP)算法。DP算法是一种基于相对距离和局部邻域密度的新型聚类算法，处理的是数值型数据，其输入是数据对象间的相异度矩阵，因此通过合适的相异度系数计算出分类型数据之间的相异度，就可将DP算法应用到分类型数据聚类上。本节利用DP算法可以自动确定聚类簇数的优点去确定初始簇中心。In 2014, Rodriguez et al. proposed the Density Peak (DP) algorithm. DP algorithm is a new type of clustering algorithm based on relative distance and local neighborhood density. It deals with numerical data, and its input is the dissimilarity matrix between data objects. Therefore, the classification is calculated by appropriate dissimilarity coefficient. The degree of dissimilarity between the data can be used to apply the DP algorithm to the classification data clustering. In this section, the DP algorithm can automatically determine the advantages of the number of clusters to determine the initial cluster center.

数据对象x_i的局部邻域密度ρ_i的值等价于以数据对象x_i为圆心，以截断距离d_c为半径区域内的数据对象个数。数据对象x_i的局部邻域密度有方波内核函数法和高斯核函数法两种定义方法：The value of the local neighborhood density ρ _i of the data object _xi is equivalent to the number of data objects in the area with the data object _xi as the center and the cut-off distance d _c as the radius. The local neighborhood density of data object x _i has two definition methods: square wave kernel function method and Gaussian kernel function method:

方波内核函数法适用于大规模数据集，方波内核函数法求ρ_i的定义如公式(7)所示：The square wave kernel function method is suitable for large-scale data sets. The definition of _ρi obtained by the square wave kernel function method is shown in formula (7):

其中：d_i,j-d_c≤0时，χ(x)＝1；否则，χ(x)＝0。Where: when d _i,j -d _c ≤0, χ(x)=1; otherwise, χ(x)=0.

如果数据集内对象数较少，采用方波内核函数计算ρ_i容易受统计误差影响导致聚类结果不准确，此时可采用高斯核函数法。高斯核函数是常用的密度估计方法，被广泛地应用在基于密度的聚类算法分析中。高斯核函数法求局部邻域密度ρ_i的定义如公式(8)所示：If the number of objects in the data set is small, the calculation of _ρi using the square wave kernel function is likely to be affected by statistical errors and cause inaccurate clustering results. At this time, the Gaussian kernel function method can be used. The Gaussian kernel function is a commonly used density estimation method and is widely used in the analysis of density-based clustering algorithms. The definition of Gaussian kernel function method to calculate the local neighborhood density _ρi is shown in formula (8):

其中：K(x)＝exp{-x²}。Where: K(x)=exp{-x ² }.

从公式(7)和公式(8)可知，d_c的取值会直接影响ρ_i的大小，进而影响簇中心的选择和整个聚类结果。因此，确定合适的d_c值对算法来说很重要。It can be seen from formula (7) and formula (8) that the value of d _c will directly affect the size of ρ _i , and then affect the selection of cluster center and the whole clustering result. Therefore, it is very important for the algorithm to determine the appropriate d _c value.

根据Alex-Li公式计算L_i的值，数据对象x_i和x_j之间的相对距离L_i的定义如公式(9)所示：The value of _Li is calculated according to the Alex-Li formula, and the relative distance _Li between data objects x _i and x _j is defined as shown in formula (9):

当x_i的ρ_i不是最大密度时，L_i定义为在所有局部邻域密度比x_i大的数据对象中，与x_i距离最近的数据对象与x_i之间的距离，如公式(10)和图2所示：When ρ _i of xi is not the maximum density, L _i is defined as the distance between the data object closest _to _xi and _xi among all data objects whose local neighborhood density is larger than _xi , as in the formula (10 ) and as shown in Figure 2:

当x_i的ρ_i是最大密度时，L_i定义为在所有局部邻域密度比x_i大的数据对象中，距x_i最远的数据对象与x_i之间的距离，如公式(11)和图3所示。同时具备高L_i和高ρ_i的数据对象即为簇中心。When ρ _i of xi is the maximum density, _Li _is defined as the distance between the data object farthest _from xi and _xi among all the data objects whose local neighborhood density is larger than _xi , such as the formula (11 ) and Figure 3. The data object with high L _i and high ρ _i at the same time is the cluster center.

截断距离d_c是一个限定距离搜索范围的临界值。DP算法的d_c值需要人为确定，将数据集中两两数据对象间的距离升序排列，取前1％至2％位置处的值即为d_c值，是一个大概范围。在实际聚类问题中d_c值设置过大，会导致求得的ρ_i重叠，d_c值设置过小，会导致聚类簇分布稀疏。本发明给出详细的d_c值确定方法。设定d_i,j＝[d_i,1,d_i,2,..,d_i,n]为数据对象x_i与x_j的相异度。用公式(1)计算d_i,j值，然后对d_i,j升序排序得到d′_i,j＝[d′_i,1,d′_i,2,...,d′_i,n]。x_i的截断距离d_c,i定义如公式(12)所示：The cutoff distance _dc is a critical value that limits the distance search range. The d _c value of the DP algorithm needs to be determined manually. The distance between two data objects in the data set is arranged in ascending order, and the value at the top 1% to 2% position is the d _c value, which is an approximate range. In the actual clustering problem, if the value of d _c is set too large, the obtained ρ _i will overlap, and if the value of d _c is set too small, the distribution of clusters will be sparse. The present invention provides a detailed d _c value determination method. Set d _i,j =[d _i,1 ,d _i,2 ,..,d _i,n ] as the degree of dissimilarity between data objects x _i and x _j . Use formula (1) to calculate d _i,j values, and then sort d _i,j in ascending order to obtain d′ _i,j =[d′ _i,1 ,d′ _i,2 ,...,d′ _i,n ] . The cut-off distance d _c,i of x _i is defined as shown in formula (12):

其中：max(d′_i,j+1-d′_i,j)是d′_i,j中相邻相异度的最大差值。Among them: max(d′ _i,j+1 -d′ _i,j ) is the maximum difference of adjacent dissimilarity in d′ _i,j .

设定d′_i,j＝d_a，d′_i,j+1＝d_b12。如图4所示，数据对象x_i与和它同簇的数据对象相异度较小，与和它不同簇的数据对象相异度较大。因此，在d′_i,j＝[d′_i,1,d′_i,2,...,d′_i,j,d′_i,j+1,...,d′_i,n]内一定存在一个临界位置使得d′_i,j+1与d′_i,j的差值最大，认为数据对象x_i和数据对象a属于同一簇，与数据对象b属于不同簇。根据Shuang YF公式计算数据对象x_i的d_c,i值，如公式(13)所示：Let d′ _i,j =d _a , d′ _i,j+1 =d _b 12 . As shown in Figure 4, the data object x _i has a small dissimilarity with data objects in the same cluster as it, and a large dissimilarity with data objects in a different cluster with it. Therefore, at d′ _i,j =[d′ _i,1 ,d′ _i,2 ,...,d′ _i,j ,d′ _i,j+1 ,...,d′ _i,n ] There must be a critical position in which the difference between d′ _i,j+1 and d′ _i,j is the largest. It is considered that data object x _i and data object a belong to the same cluster, and data object b belongs to a different cluster. Calculate the _dc,i value of the data object x _i according to the Shuang YF formula, as shown in formula (13):

d_c值定义为集合d_c,i的最小值如公式(14)所示。The value of d _c is defined as the minimum value of set d _c,i as shown in formula (14).

d_c＝min(d_c,i) (14)d _c =min(d _c,i ) (14)

IKMCA基于以下两个假设确定初始簇中心：(1)簇中心的局部邻域密度高于周围非簇中心点。(2)各簇中心之间的相对距离较大。基于上述假设，本节给出初始簇中心自动确定的方法。如图5所示是一个二维示例数据集，共有93个数据对象，2个聚类簇，对应2个簇中心。IKMCA determines initial cluster centers based on the following two assumptions: (1) The local neighborhood density of cluster centers is higher than that of surrounding non-cluster centers. (2) The relative distance between the cluster centers is relatively large. Based on the above assumptions, this section presents a method for automatic determination of initial cluster centers. As shown in Figure 5, it is a two-dimensional example data set, with a total of 93 data objects, 2 clusters, corresponding to 2 cluster centers.

DP算法簇中心的选择是通过决策图确定的。如图6所示决策图的横轴为数据对象x_i的局部邻域密度ρ_i，纵轴为相对距离L_i。ρ_i和L_i的值同时大的值即为数据集的簇中心。图6右上角的两个点即为图6中两个簇对应的簇中心。簇中心周围包围着大量的数据对象，其局部邻域密度ρ_i和相对距离L_i都较大。The selection of the cluster center of the DP algorithm is determined through the decision diagram. As shown in Fig. 6, the horizontal axis of the decision graph is the local neighborhood density ρ _i of the data object x _i , and the vertical axis is the relative distance L _i . The value of ρ _i and L _i that are large at the same time is the cluster center of the data set. The two points in the upper right corner of Figure 6 are the cluster centers corresponding to the two clusters in Figure 6. The cluster center is surrounded by a large number of data objects, and its local neighborhood density ρ _i and relative distance L _i are both large.

为了更加直观的观察和确定簇中心，考虑使用Z_i决策图来选择簇中心。通过公式(8)和公式(9)得到每个数据对象的局部邻域密度和相对距离。根据公式Z_i＝ρ_i×L_i计算出所有数据对象的Z_i值，将Z_i值降序排序得到排序序列Z⁽¹⁾>Z⁽²⁾>...Z⁽ⁿ⁾，其中Z⁽¹⁾>Z⁽²⁾>...Z^(k),(k<n)对应的点即为簇中心。如图7所示，Zi决策图的横轴是数据对象xi的下标，纵轴是Z_i值，Z_i值越大越有可能是簇中心。从簇中心过渡到非簇中心，Z_i图上存在一个非常明显的拐点，拐点对应的横坐标即为聚类个数k，拐点左边部分数据对象即为簇中心点，拐点右边部分的数据对象即为非簇中心点。如图7所示，左上方的两个点是簇中心点，右下方分布非常平滑的点全部为非簇中心点。In order to observe and determine the cluster center more intuitively, consider using the Z _i decision diagram to select the cluster center. The local neighborhood density and relative distance of each data object are obtained by formula (8) and formula (9). Calculate the Z _i values of all data objects according to the formula Z _i =ρ _i ×L _i , sort the Z _i values in descending order to obtain the sorting sequence Z ⁽¹⁾ >Z ⁽²⁾ >...Z ⁽ⁿ⁾ , where Z ^{( 1)} >Z ⁽²⁾ >...Z ^(k) , the point corresponding to (k<n) is the cluster center. As shown in Figure 7, the horizontal axis of the Zi decision diagram is the subscript of the data object xi, and the vertical axis is the value of _Zi . The larger the value of _Zi , the more likely it is the cluster center. From the cluster center to the non-cluster center, there is a very obvious inflection point on the Z _i diagram, the abscissa corresponding to the inflection point is the number of clusters k, the data object on the left side of the inflection point is the cluster center point, and the data object on the right side of the inflection point is the non-cluster center point. As shown in Figure 7, the two points on the upper left are cluster center points, and the points on the lower right with very smooth distribution are all non-cluster center points.

新的簇内簇间相异度系数使分类型数据相异度计算更加准确。初始簇中心的自主选取避免了经典k-modes算法随机选取或者人为手动设置带来的聚类结果不确定。DP算法只给出d_c的大致范围，本发明在DP算法的基础上给出了d_c的明确确定方法。将簇内簇间相异度系数应用到经典k-modes算法中，其目标函数定义如公式(15)所示。定理1展示了如何最小化目标函数F(U,Q)。The new intra-cluster and inter-cluster dissimilarity coefficient makes the dissimilarity calculation of subcategory data more accurate. The autonomous selection of the initial cluster center avoids the uncertainty of the clustering results caused by the random selection of the classic k-modes algorithm or manual setting. The DP algorithm only gives the approximate range of d _c , and the present invention provides a definite determination method of d _c on the basis of the DP algorithm. The intra-cluster and inter-cluster dissimilarity coefficient is applied to the classical k-modes algorithm, and its objective function is defined as shown in formula (15). Theorem 1 shows how to minimize the objective function F(U,Q).

u_il∈{0,1},1≤i≤n,1≤l≤k (16)u _il ∈ {0,1}, 1≤i≤n, 1≤l≤k (16)

U_n×k是满足约束条件(16～18)的隶属度矩阵，u_il＝1表示x_i属于簇C_l。在满足约束条件(16～18)的情况下，目标函数F(U,Q)达到极小值，此时可以判断聚类算法结束。U _n×k is a membership degree matrix satisfying the constraints (16~18), and u _il =1 means that x _i belongs to cluster C _l . When the constraint conditions (16-18) are met, the objective function F(U, Q) reaches the minimum value, and at this time, it can be judged that the clustering algorithm ends.

定理1：IKMCA的簇中心选取应使得函数F(U,Q)最小化，当且仅当t_s≠t_h(1≤s≤m)。文字描述为，簇中心的各特征值应选取数据集中各特征上出现频率最大的特征值。Theorem 1: The cluster center selection of IKMCA should minimize the function F(U,Q) if and only if t _s ≠t _h (1≤s≤m). The text description is that each eigenvalue of the cluster center should select the eigenvalue with the highest frequency of occurrence on each feature in the data set.

f_s,t(x_i)(1≤s≤m,1≤t≤n_s)表示x_i在第s个特征下取值为A_s,t的个数，如公式(19)所示：f _s,t ( _xi )(1≤s≤m,1≤t≤n _s ) indicates the number of x _i taking the value A _s,t under the sth feature, as shown in formula (19):

f_s,t(x_i)＝|{x_i∈D,x_i,s＝A_s,t}| (19)f _s,t ( _xi )＝|{ _xi ∈D,xi _,s ＝A _s,t }| (19)

当且仅当q_l＝DOM(A_s),1≤s≤m并且满足公式(20)时，函数F(U,Q)被最小化：The function F(U,Q) is minimized if and only if q _l =DOM(A _s ), 1≤s≤m and formula (20) is satisfied:

为了使目标函数F(U,Q)达到极小值，改进后的基于簇内簇间相异度的k-modes算法(IKMCA)即一种基于簇内簇间相异度的分类型数据聚类方法，如图8所示，描述如下：In order to make the objective function F(U,Q) reach the minimum value, the improved k-modes algorithm based on intra-cluster dissimilarity between clusters (IKMCA) is a classification data aggregation method based on intra-cluster dissimilarity. The class method, as shown in Figure 8, is described as follows:

输入：具有n个数据对象，m个分类型特征的分类型数据集D。输出：聚类完成的簇集合C＝{C₁,C₂,...,C_k}。Input: a categorical dataset D with n data objects and m categorical features. Output: the completed cluster set C={C ₁ ,C ₂ ,...,C _k }.

step1.通过公式(1)计算相异度d_i,j，并得到相异矩阵d_n×n；step1. Calculate the degree of dissimilarity d _i,j by formula (1), and obtain the dissimilarity matrix d _n×n ;

step2.根据公式(14)计算截断距离dc；step2. Calculate the cut-off distance dc according to formula (14);

step3.利用公式(7)或公式(8)计算局部邻域密度ρ_i；step3. Use formula (7) or formula (8) to calculate the local neighborhood density ρ _i ;

step4.利用公式(9)计算相对距离L_i；step4. Utilize the formula (9) to calculate the relative distance L _i ;

step5.根据公式Z_i＝ρ_i×L_i，计算得到Z_i＝{Z₁,Z₂,...,Z_n}；step5. According to the formula Z _i =ρ _i ×L _i , calculate Z _i ={Z ₁ ,Z ₂ ,...,Z _n };

step6.将Z_i降序排序，得到排序序列Z^(1)>Z⁽²⁾>...>Z⁽ⁿ⁾。以数据对象x_i的下标为横坐标，以Z_i为纵坐标绘制Z_i决策图，确定图中的拐点，拐点处的横坐标值即为最佳k值。step6. Sort Z _i in descending order to obtain the sorted sequence Z ^(1)> Z ⁽²⁾ >...>Z ⁽ⁿ⁾ . Use the subscript of the data object x _i as the abscissa, and Z _i as the ordinate to draw the Z _i decision-making diagram, determine the inflection point in the graph, and the abscissa value at the inflection point is the optimal k value.

Step7.确定k值和初始簇中心集合q⁽⁰⁾＝{q₁,q₂,...,q_k}；Step7. Determine k value and initial cluster center set q ⁽⁰⁾ ={q ₁ ,q ₂ ,...,q _k };

Step8.根据公式(6)计算数据集中n-k个数据对象与k个初始簇中心之间的相异度d(x_i,q_l)；Step8. Calculate the dissimilarity d(x _i , q _l ) between nk data objects and k initial cluster centers in the data set according to formula (6);

Step9.根据就近原则将数据对象分配到离它最近的初始簇中去，分配完成后，得到k个聚类簇C⁽¹⁾＝{C₁,C₂,...,C_k}，标记这(n-k)个数据对象的簇标签；Step9. According to the principle of proximity, allocate the data object to the initial cluster closest to it. After the allocation is completed, k clusters C ⁽¹⁾ = {C ₁ ,C ₂ ,...,C _k } are obtained, marked The cluster label of these (nk) data objects;

Step10.在新形成的聚类簇上根据定理1更新簇中心q⁽¹⁾＝{q₁,q₂,..,q_k}；Step10. Update the cluster center q ⁽¹⁾ ={q ₁ ,q ₂ ,..,q _k } according to Theorem 1 on the newly formed cluster;

Step11.重复步骤8-11，直到目标函数值不再发生变化。如果目标函数值不再发生变化，则算法结束；否则重复Step8继续执行；Step11. Repeat steps 8-11 until the objective function value no longer changes. If the objective function value no longer changes, the algorithm ends; otherwise, repeat Step8 to continue execution;

Step12.算法结束，完成聚类。Step12. The algorithm ends and the clustering is completed.

假设l是算法收敛所需的迭代次数，通常情况下n>>m,k,l。IKMAC算法的时间复杂度主要是在每次迭代中更新簇中心和相异度。初始化簇中心需要人工观察决策图决定，因此此阶段的时间复杂度暂不考虑进总体算法中。使用簇内簇间相异度在每次迭代中更新簇中心和相异度计算的时间复杂度是l(O(nmk)+O(nmk))＝O(nmkl)，所以IKMCA的总时间复杂度是O(nmkl)。从上述分析可以发现，使簇内簇间相异度的IKMCA算法的时间复杂度相对数数据对象的数量，聚类个数和特征个数是线性可缩放的。Suppose l is the number of iterations required for the algorithm to converge, usually n>>m,k,l. The time complexity of the IKMAC algorithm is mainly to update the cluster center and dissimilarity in each iteration. Initializing the cluster center requires manual observation of the decision graph, so the time complexity at this stage is not considered in the overall algorithm for the time being. The time complexity of updating the cluster center and dissimilarity calculation in each iteration using intra-cluster and inter-cluster dissimilarity is l(O(nmk)+O(nmk))=O(nmkl), so the total time complexity of IKMCA The degree is O(nmkl). From the above analysis, it can be found that the time complexity of the IKMCA algorithm that makes the intra-cluster and inter-cluster dissimilarity is linearly scalable with respect to the number of data objects, the number of clusters and the number of features.

为了评估提出算法的有效性，下面分别从聚类精度AC、纯度PR、召回率RE三个指标对聚类结果进行评价，分别如公式(21)～公式(23)所示。NUM+表示被正确划分到簇C_l的数据对象个数；NUM-表示没有被正确划分到簇C_l的数据对象个数；NUM*表示应该被划分到簇C_l但实际上没有被划分到簇C_l的数据对象个数。聚类结果与数据集的真实划分越接近，AC、PE和RE的值就越大，算法越有效。In order to evaluate the effectiveness of the proposed algorithm, the clustering results are evaluated from the following three indicators: clustering accuracy AC, purity PR, and recall RE, as shown in formula (21) to formula (23). NUM+ indicates the number of data objects that are correctly divided into cluster _C1 ; NUM- indicates the number of data objects that are not correctly divided into cluster _C1 ; NUM* indicates that they should be divided into cluster _C1 but are not actually divided into clusters The number of data objects of C _l . The closer the clustering result is to the real division of the data set, the greater the values of AC, PE and RE, and the more effective the algorithm is.

算法用Python语言实现，所有实验均在intel(R)Core(TM)处理器i7-8700K CPU@3.70GHz,Windows 10操作系统上运行。使用数据集来自真实数据集。UCI是加州大学欧文分校提供的专门用于机器学习的真实数据集。为了测试算法的有效性，从UCI数据集中选取Mushroom(简称Mus)，Breast-cancer(简称Bre)，Car和Soybean-small(简称Soy)数据集进行实验验证。表5列出了这些数据集详细信息。The algorithm is implemented in Python language, and all experiments are run on the intel(R) Core(TM) processor i7-8700K CPU@3.70GHz, Windows 10 operating system. The datasets used are from real datasets. UCI is a real-world dataset dedicated to machine learning provided by the University of California, Irvine. In order to test the effectiveness of the algorithm, Mushroom (Mus for short), Breast-cancer (Bre for short), Car and Soybean-small (Soy for short) data sets were selected from the UCI data set for experimental verification. Table 5 lists these dataset details.

表5数据集描述Table 5 Dataset description

将本发明提出的IKMCA算法与Huang提出的k-modes算法、Ng等人提出IDMKCA算法和Ravi等人提出EKACMD算法分别运行30次取平均值。AC、PE和RE的计算结果如表6～表9所示。The IKMCA algorithm proposed by the present invention and the k-modes algorithm proposed by Huang, the IDMKCA algorithm proposed by Ng et al. and the EKACMD algorithm proposed by Ravi et al. were run 30 times to obtain the average value. The calculation results of AC, PE and RE are shown in Table 6 to Table 9.

表6四种算法在Mus数据集下的实验结果Table 6 Experimental results of the four algorithms in the Mus dataset

表7四种算法在Bre数据集下的实验结果Table 7 Experimental results of four algorithms in Bre dataset

表8四种算法在Car数据集下的实验结果Table 8 Experimental results of the four algorithms in the Car dataset

表9四种算法在Soy数据集下的实验结果Table 9 Experimental results of the four algorithms in the Soy dataset

从上述实验结果可以看出，对于Mus、Bre、Car和Soy数据集而言，大多数情况下IKMCA在AC、PR和RE上优于k-modes算法、IDMKCA算法和EKACMD算法。IMKCA算法优于经典k-modes算法的原因是，k-modes算法的预处理破坏了分类型特征的原始结构。转换后的分类特征值使用简单汉明距离计算相异度系数并不能揭示分类型数据之间的相异度。当数据集的特征非常多时，简单的0-1对比可能产生非常大的相异度，也可能产生非常小的相异异度甚至是相异差异度。跟经典k-modes算法和IDMKCA、EKACMD算法相比，提出的簇内簇间相异度可以更好地揭示数据集的结构。From the above experimental results, it can be seen that for the Mus, Bre, Car, and Soy datasets, IKMCA outperforms the k-modes algorithm, IDMKCA algorithm, and EKACMD algorithm in AC, PR, and RE in most cases. The reason why the IMKCA algorithm is superior to the classic k-modes algorithm is that the preprocessing of the k-modes algorithm destroys the original structure of the classification feature. Calculating the dissimilarity coefficient using simple Hamming distance for transformed categorical feature values does not reveal the dissimilarity between categorical data. When there are many features in the data set, a simple 0-1 comparison may produce a very large degree of dissimilarity, or may produce a very small degree of dissimilarity or even a degree of dissimilarity. Compared with the classical k-modes algorithm and the IDMKCA and EKACMD algorithms, the proposed intra-cluster and inter-cluster dissimilarity can better reveal the structure of the data set.

经典k-modes算法使用简单汉明距离进行相异度计算，弱化了类内相似性，忽略了簇间相似性。针对这些问题，本发明基于簇内簇间相似性提出新的相异度计算方法。该方法可以防止聚类过程中的重要特征值的丢失，强化了簇内特征值之间的相似性，弱化了簇间特征值之间的相似性。提出的簇中心自动选择方法大大减少了随机选取簇中心或者手动选择选取簇中心给聚类带来的误差。本发明用一些说明性例子讨论了k-modes算法中使用简单汉明距离等其他几种相异度系数的局限性，并提出了一种新的相异度系数。基于本发明提出相异度系数改进的分类型数据聚类算法与基于其他相异度系数的k-modes算法在UCI数据集上进行了实验。实验结果表明，本发明提出的相异度系数计算方法保留了数据的特征，做到了低簇内相异度高簇间相异性的标准，在聚类精度、纯度和召回率方面均有提高，有效提高了分类型数据的聚类效果。The classic k-modes algorithm uses simple Hamming distance to calculate the dissimilarity, which weakens the intra-class similarity and ignores the inter-cluster similarity. To solve these problems, the present invention proposes a new calculation method of dissimilarity based on the similarity between clusters within a cluster. This method can prevent the loss of important eigenvalues in the clustering process, strengthen the similarity between eigenvalues within a cluster, and weaken the similarity between eigenvalues between clusters. The proposed automatic cluster center selection method greatly reduces the errors brought by random selection of cluster centers or manual selection of cluster centers to clustering. The present invention discusses the limitation of using other dissimilarity coefficients such as simple Hamming distance in the k-modes algorithm with some illustrative examples, and proposes a new dissimilarity coefficient. The classification data clustering algorithm based on the improved dissimilarity coefficient proposed by the present invention and the k-modes algorithm based on other dissimilarity coefficients are tested on the UCI data set. Experimental results show that the dissimilarity coefficient calculation method proposed by the present invention retains the characteristics of the data, achieves the standard of low intra-cluster dissimilarity and high inter-cluster dissimilarity, and improves clustering accuracy, purity and recall. Effectively improve the clustering effect of classified data.

本发明所提出算法(IKMCA)的可应用于包含海量纯分类型数据的康复治疗方案推荐系统，在康复治疗方案推荐系统上，分类型数据聚类可以帮助分析人员从患者数据库中区分出不同的患者群体，发现数据库中分布的一些深层的信息，制定个性化的康复方案。The algorithm (IKMCA) proposed by the present invention can be applied to a rehabilitation treatment plan recommendation system containing massive pure classification data. On the rehabilitation treatment plan recommendation system, classification data clustering can help analysts distinguish different patients from the patient database. Patient groups, discover some in-depth information distributed in the database, and formulate personalized rehabilitation plans.

需要说明的是，尽管以上本发明所述的实施例是说明性的，但这并非是对本发明的限制，因此本发明并不局限于上述具体实施方式中。在不脱离本发明原理的情况下，凡是本领域技术人员在本发明的启示下获得的其它实施方式，均视为在本发明的保护之内。It should be noted that although the above-mentioned embodiments of the present invention are illustrative, they are not intended to limit the present invention, so the present invention is not limited to the above specific implementation manners. Without departing from the principles of the present invention, all other implementations obtained by those skilled in the art under the inspiration of the present invention are deemed to be within the protection of the present invention.

Claims

1. A classification data clustering method based on intra-cluster and inter-cluster dissimilarity, which is applied to a rehabilitation treatment program recommendation system containing massive pure classification data. In the rehabilitation treatment program recommendation system, classification data clustering can help The analyst distinguishes different patient groups from the patient database, and formulates a personalized rehabilitation plan; the feature is that it includes the following steps:

Step 1. For a classified data set D with n data objects, use the simple Hamming distance to calculate the dissimilarity d _i,j between every two data objects;

Step 2. For each data object x _i of the sub-type data set D, first sort the dissimilarity d _{i, j} between the data object x _i and other data correspondences in ascending order, and obtain the corresponding data object x _i Dissimilarity vector d' _i,j ＝[d' _i,1 ,d' _i,2 ,...,d' _i,n ]; then two adjacent dissimilarity vectors d' _i,j The maximum difference of the dissimilarity is taken as the cut-off distance d _c,i of the data object x _i ;

Step 3. Select the minimum value of the cut-off distance dc _,i of all data objects in the classification data set D as the cut-off distance _dc of the classification data set D;

Step 4. Calculate the local neighborhood density _ρ i of each data object x _i in the classification data set D based on the cut-off distance d _c of the classification data set D by using the square wave kernel function method or the Gaussian kernel function method;

Step 5. Calculate the relative distance L _i of each data object x _i in the classification data set D:

Step 6. For each data object _xi in the classification data set D, use the local neighborhood density ρ _i and the relative distance L _i of the data object _xi to obtain the decision graph Z _i of the data object _xi :

Z _i =ρ _i ×L _i

Step 7. First sort the decision graph Z _i of all data objects in the classification data set D in descending order to obtain a sorting sequence; then based on the sorting sequence, use the subscript i of the data object x _i as the abscissa, and the data object x The decision diagram _Z of _i is the vertical coordinate to draw the decision diagram of the classification data set D, and the abscissa at the inflection point in the decision diagram of the classification data set D is the selected number of clusters k;

Step 8, select k data objects from the classification data set D to form the current cluster center set;

Step 9. Based on the current set of cluster centers, calculate the dissimilarity d( _xi ,q _l ) between the remaining nk data objects _xi and k cluster centers q _l in the subtype data set D:

Step 10. According to the dissimilarity d( _xi ,q _l ) between the data object _xi and the cluster center q _l , and based on the principle of proximity, assign nk data objects to the nearest cluster. After the assignment is completed, Obtain k clusters, and mark the cluster labels of these nk data objects, thereby obtaining the clustering result based on the current cluster center set;

Step 11. For the formed k clusters, select the eigenvalue with the highest frequency of occurrence on each dimension feature from each cluster to form a new cluster center of the cluster, and obtain a new set of cluster centers;

Step 12. Repeat steps 9-11 until each cluster center no longer changes or reaches the specified maximum number of iterations, the algorithm terminates, and the clustering result based on the current cluster center set is output; otherwise, the obtained new cluster center Set as the current cluster center set, and skip to step 9 to continue iteration;

Among them, i, j=1,2,...,n, n is the number of data objects in the classification data set D; s=1,2,...,m, m is the number of features of the data objects; l=1 ,2,...,k, is the number of clusters; δ(A _i,s ,A _ql,s ) is the dissimilarity between the data object x _i and cluster center q _l under the s-th dimension feature; A _i,s is The s-th dimension feature of the data object x _i ; A _ql,s is the s-th dimension feature of the cluster center q _l ; is the number of data objects whose feature value is _As,t in cluster C _l ; |C _l | is the number of data objects in cluster C _l ; ζ _l is the adjustment coefficient.

2. a kind of classification data clustering method based on the dissimilarity between clusters in a cluster according to claim 1, is characterized in that, in step 4, for the large-scale classification data set D that data object is greater than or equal to 10TB, Use the square wave kernel function method to calculate the local neighborhood density ρ _i of the data object x _i ; for the small-scale classification data set D with the data object less than 10TB, use the Gaussian kernel function method to calculate the local neighborhood density ρ _i of the data object x _i .