CN111626321A - Image data clustering method and device - Google Patents
Image data clustering method and device Download PDFInfo
- Publication number
- CN111626321A CN111626321A CN202010260470.7A CN202010260470A CN111626321A CN 111626321 A CN111626321 A CN 111626321A CN 202010260470 A CN202010260470 A CN 202010260470A CN 111626321 A CN111626321 A CN 111626321A
- Authority
- CN
- China
- Prior art keywords
- cluster
- points
- data
- point
- micro
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000012545 processing Methods 0.000 claims abstract description 7
- 230000015654 memory Effects 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 9
- 230000002159 abnormal effect Effects 0.000 claims description 7
- 230000009467 reduction Effects 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 description 38
- 238000010586 diagram Methods 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 6
- 238000005192 partition Methods 0.000 description 6
- 238000011156 evaluation Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 230000002547 anomalous effect Effects 0.000 description 3
- 238000001739 density measurement Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 229910021389 graphene Inorganic materials 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 239000006390 lc 2 Substances 0.000 description 1
- 239000002609 medium Substances 0.000 description 1
- 238000013433 optimization analysis Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及一种图像数据的聚类方法及装置,属于图像处理技术领域。The invention relates to a clustering method and device for image data, belonging to the technical field of image processing.
背景技术Background technique
聚类分析是一种无监督的分类方法。其目标是将没有分类标签的数据集,分为若干个簇,同时也要保证簇内对象彼此相似,簇间对象不相似,其可应用于优化分析,图像分割,生物信息学等众多领域。Cluster analysis is an unsupervised classification method. The goal is to divide the data set without classification labels into several clusters, and at the same time to ensure that the objects in the clusters are similar to each other and the objects between clusters are not similar. It can be applied to optimization analysis, image segmentation, bioinformatics and many other fields.
可以采用聚类方法对图像数据做分类处理,密度峰值聚类算法(Clustering byfast search and find of density peaks,DPC)是Alex and Laio提出的一种聚类方法,因其有效、新颖和简单,近年来其应用越来越广。DPC的提出基于两个假设:(1)簇中心被密度比其低的邻居点围绕;(2)簇中心之间的距离相对较远。为此,提出两个概念:(1)数据点xi的局部密度,以ρi表示;(2)比数据点xi密度高且距离xi最近的点到xi的距离,以δi表示。在以上两个假设基础上,DPC算法的作者提出了算法的总体思路:首先选取ρi值、δi值都较大数据点作为簇中心点,然后将非中心点分配至密度比自身高,且距离自身最近的数据点所在的簇。尽管DPC各方面性能都很好,但是仍然存在一定的缺陷:截断距离dc的选择影响到聚类的准确性,即dc值的微小变动影响数据点的密度和聚类中心的选取,进而影响最终的聚类结果;聚类中心需要通过人工方式选取,具有一定的主观因素,影响最终聚类结果的客观性。并且对于簇之间密度差别较大的数据集,DPC容易忽略掉密度较小的簇的簇中心。The clustering method can be used to classify the image data. The density peak clustering algorithm (Clustering by fast search and find of density peaks, DPC) is a clustering method proposed by Alex and Laio. It is effective, novel and simple. Its application is getting wider and wider. The proposal of DPC is based on two assumptions: (1) the cluster centers are surrounded by neighbors with lower density than it; (2) the distance between the cluster centers is relatively far. To this end, two concepts are proposed: (1) the local density of the data point xi , represented by ρ i ; (2) the distance from the point with a higher density than the data point xi and the closest to xi to xi , represented by δ i express. On the basis of the above two assumptions, the author of the DPC algorithm proposed the general idea of the algorithm: first, select the data points with larger ρ i and δ i values as the cluster center points, and then assign the non-center points to the density higher than itself, And the cluster where the data point closest to itself is located. Although the performance of DPC is good in all aspects, there are still certain defects: the selection of the truncation distance dc affects the accuracy of clustering, that is, the slight change of the dc value affects the density of data points and the selection of cluster centers, and then Affect the final clustering result; the clustering center needs to be selected manually, which has certain subjective factors and affects the objectivity of the final clustering result. And for datasets with large density differences between clusters, DPC tends to ignore the cluster centers of clusters with low density.
在此基础上,很多学者对密度峰值聚类算法都做出了相应的改进。例如:高诗莹等提出了基于密度比例的密度峰值聚类算法即R-DPC,该算法将密度比例引入到DPC中,通过计算样本数据的密度比来提高数据中密度较小的簇的辨识度,进而提升整体聚类的准确率;薛小娜等提出了结合k近邻的改进密度峰值聚类算法,给出新的局部密度度量方法来描述每个样本在空间中的分布情况,有效提高了聚类质量;Xie等提出了一种基于模糊加权k最近邻技术的密度峰值搜索和点分配算法,以解决DPC算法中数据点密度测量方法不一致的问题,该方法使用k近邻信息来定义点的局部密度并搜索和发现聚类中心。上述改进在一定程度上提高了密度峰值聚类算法的性能,但是上述改进难以应对可变密度的数据集且时间复杂度较高。On this basis, many scholars have made corresponding improvements to the density peak clustering algorithm. For example, Gao Shiying et al. proposed a density peak clustering algorithm based on density ratio, namely R-DPC, which introduced the density ratio into DPC, and improved the identification of clusters with less density in the data by calculating the density ratio of the sample data. And then improve the accuracy of the overall clustering; Xue Xiaona et al. proposed an improved density peak clustering algorithm combined with k-nearest neighbors, and gave a new local density measurement method to describe the distribution of each sample in space, which effectively improved the clustering quality. Xie et al. proposed a density peak search and point assignment algorithm based on the fuzzy weighted k-nearest neighbor technique to solve the problem of inconsistent data point density measurement methods in the DPC algorithm, which uses k-nearest neighbor information to define the local density of points and Search and discover cluster centers. The above improvements improve the performance of the density peak clustering algorithm to a certain extent, but the above improvements are difficult to deal with variable density datasets and have high time complexity.
发明内容SUMMARY OF THE INVENTION
本发明的目的是提供一种图像数据的聚类方法及装置,以解决目前图像数据聚类效果不好的问题。The purpose of the present invention is to provide a method and device for clustering image data, so as to solve the problem that the current image data clustering effect is not good.
本发明为解决上述技术问题而提供一种图像数据的聚类方法,该聚类方法包括以下步骤:The present invention provides a clustering method for image data in order to solve the above-mentioned technical problems, and the clustering method comprises the following steps:
1)获取待聚类的图像数据,对图像数据做降维处理,将每幅图像作为一个数据点,确定每个数据点的密度;1) Obtain the image data to be clustered, perform dimensionality reduction processing on the image data, use each image as a data point, and determine the density of each data point;
2)对于每一个数据点,判断其密度是否大于其邻域中数据点的密度,若是,则将其推荐为局部中心点,以所述局部中心点作为中心,将剩余数据点分配至局部中心点,生成微簇;2) For each data point, determine whether its density is greater than the density of the data points in its neighborhood, if so, recommend it as the local center point, and use the local center point as the center, and assign the remaining data points to the local center. point to generate microclusters;
3)按照第一设定比例确定边界点的数量,选取边界度较大的相应数量的点作为边界点,在不考虑边界点的情况下,确定两个微簇之间是否构成邻簇关系;3) Determine the number of boundary points according to the first set ratio, select a corresponding number of points with a larger boundary degree as the boundary points, and determine whether the adjacent cluster relationship is formed between the two microclusters without considering the boundary points;
4)按照第二设定比例确定边界点,判断确定的边界点中是否包含局部中心点,若边界点中包含局部中心点,则对于这些局部中心点所在的微簇,从其邻簇中选取与其结合度最高的微簇进行合并,并删除边界点;4) Determine the boundary points according to the second set ratio, and judge whether the determined boundary points include local center points. If the boundary points include local center points, select the microclusters where these local center points are located from the neighboring clusters. Merge the microclusters with the highest degree of association, and delete the boundary points;
继续按照第二设定比例从剩余点中再次确定边界点,得到下一层边界点,判断所述下一层边界点中是否包含局部中心点,若包括,则对于这些局部中心点所在的微簇,从其邻簇中选取与其结合度最高的微簇进行合并,并删除该层边界点,直到删除完设定次数的边界点为止,删除完设定次数的边界点后,对于局部中心点还存在的微簇,若其邻簇中还有数据点存在,则将该微簇与其邻簇合并。Continue to determine the boundary points again from the remaining points according to the second set ratio, obtain the next layer of boundary points, and judge whether the next layer of boundary points contains local center points. Cluster, select the microcluster with the highest degree of association from its neighboring clusters to merge, and delete the boundary points of this layer until the boundary points of the set number of times are deleted. After the boundary points of the set number of times are deleted, for the local center point For the microclusters that still exist, if there are still data points in the adjacent clusters, the microclusters are merged with their adjacent clusters.
本发明还提供了一种图像数据的聚类装置,该聚类装置包括存储器和处理器,以及存储在所述存储器上并在所述处理器上运行的计算机程序,所述处理器与所述存储器相耦合,所述处理器执行所述计算机程序时实现本发明的图像数据的聚类方法。The present invention also provides a clustering device for image data, the clustering device includes a memory and a processor, and a computer program stored on the memory and running on the processor, the processor and the processor The memory is coupled, and the processor implements the image data clustering method of the present invention when executing the computer program.
本发明的图像聚类方法,采用自荐策略来选取局部中心点,消除了人工选取簇中心的主观性,解决了密度较小的簇的中心点被忽略的问题;接下来,以局部中心点作为微簇中心点,生成微簇;然后提出了微簇合并方法,在层层删除边界点的过程中,从一个微簇的邻簇中选取与其结合度最高的微簇进行合并因为这样的微簇更靠近簇中心,更有可能与这个微簇处于同一个簇,使得最终的聚类结果更加准确。The image clustering method of the present invention adopts a self-recommendation strategy to select local center points, which eliminates the subjectivity of manually selecting cluster centers, and solves the problem that the center points of clusters with smaller density are ignored; The micro-cluster center point is used to generate micro-clusters; then a micro-cluster merging method is proposed. In the process of deleting boundary points layer by layer, the micro-clusters with the highest degree of association are selected from the neighboring clusters of a micro-cluster and merged because such micro-clusters Closer to the cluster center, it is more likely to be in the same cluster as this microcluster, making the final clustering result more accurate.
进一步地,为了解决计算密度时截断距离难以选取的问题,步骤1)中各点的密度值根据该点的互邻度和该点与其邻域中点的距离的均值来确定。Further, in order to solve the problem that the truncation distance is difficult to select when calculating the density, the density value of each point in step 1) is determined according to the mutual adjacency of the point and the average value of the distance between the point and the midpoint of its neighborhood.
进一步地,本发明给出了计算数据点密度的公式,密度值的计算公式为:Further, the present invention provides a formula for calculating the density of data points, and the formula for calculating the density value is:
其中ρ(xp)表示数据点xp的密度,MND(xp)表示数据点xp的互邻度,NG(xp)表示数据点xp的邻域,|NG(xp)|表示数据点xp邻域中数据点的个数,表示数据点xp到数据点xq的加权欧式距离,在数据点xp的k个邻居中,如果xq是数据点的xp第ε个邻居,那么εpq的值就是ε,ε的值越大,xq到xp的距离就越远;若xp是数据点的xq第ε个邻居,εqp的值就是ε。where ρ(x p ) is the density of the data point x p , MND(x p ) is the mutual adjacency of the data point x p , NG(x p ) is the neighborhood of the data point x p , |NG(x p )| represents the number of data points in the neighborhood of data point x p , Represents the weighted Euclidean distance from the data point x p to the data point x q . Among the k neighbors of the data point x p , if x q is the ε-th neighbor of the data point x p , then the value of ε pq is ε, ε The larger the value, the further the distance from x q to x p ; if x p is the ε-th neighbor of x q of the data point, the value of ε qp is ε.
进一步地,为了反应数据点之间的内在关系,两个数据点之间的距离采用加权欧式距离,以两点的皮尔逊相关系数作为权值。Further, in order to reflect the internal relationship between the data points, the distance between the two data points adopts the weighted Euclidean distance, and the Pearson correlation coefficient of the two points is used as the weight.
进一步地,为了准确确定数据点是否为边界点,提出了数据点边界度的概念,边界度的值越大,数据点就越有可能成为边界点,所述数据点的边界度的计算公式为:Further, in order to accurately determine whether a data point is a boundary point, the concept of the boundary degree of a data point is proposed. The larger the value of the boundary degree, the more likely the data point becomes a boundary point. The calculation formula of the boundary degree of the data point is: :
其中bp(xp)表示数据点xp的边界度,|kNN(xp)|表示数据点Xp的邻居的个数,kNN(xp)表示数据点xp的k个邻居点,为数据点Xp和数据点Xq之间的皮尔逊相关系数,m为数据集的维度。where bp(x p ) represents the boundary degree of the data point x p , |kNN(x p )| represents the number of neighbors of the data point x p , kNN(x p ) represents the k neighbors of the data point x p , is the Pearson correlation coefficient between data point X p and data point X q , and m is the dimension of the data set.
进一步地,为了准确确定微簇之间的关系,所述相邻微簇的判断依据为:Further, in order to accurately determine the relationship between the micro-clusters, the judgment basis of the adjacent micro-clusters is:
其中NG(xp)表示数据点xp的邻域,ci表示形成的第i个微簇,R(ci,cj)表示微簇ci和微簇cj之间构成了邻簇关系。where NG(x p ) represents the neighborhood of the data point x p , c i represents the i-th micro-cluster formed, and R( ci , c j ) represents the adjacent cluster formed between micro-cluster c i and micro-cluster c j relation.
进一步地,在形成微簇后,该方法将微簇的局部中心点更改为微簇中所有数据点的均值点,若这样的均值点不存在,则改为微簇中距离均值点最近的数据点。Further, after the microcluster is formed, the method changes the local center point of the microcluster to the mean point of all data points in the microcluster. If such a mean point does not exist, it is changed to the data closest to the mean point in the microcluster. point.
进一步地,本发明给出了结合度的计算公式,所述的结合度采用的计算公式为:Further, the present invention provides the calculation formula of the degree of binding, and the calculation formula adopted by the degree of binding is:
|cj|表示初始状态下微簇cj中数据点的个数,|cjt|表示在删掉t层边界点后,微簇cj中数据点的个数,的值越大,cj就越靠近簇核心区域,cj和ci的结合度就越高,表示两个局部中心点之间的加权欧式距离。|c j | represents the number of data points in the micro-cluster c j in the initial state, |c jt | represents the number of data points in the micro-cluster c j after deleting the boundary points of the t layer, The larger the value of , the closer c j is to the cluster core area, the higher the degree of association between c j and c i , Represents the weighted Euclidean distance between two local center points.
进一步地,为了避免一些异常点单独成簇,该方法还包括对微簇合并后的簇进行调整的过程:Further, in order to prevent some outliers from clustering alone, the method also includes a process of adjusting the clusters after the micro-clusters are merged:
合并同源簇后,判断得到的簇中数据点个数是否少于或等于数据点总量的0%-5%,若是,则将该簇作为异常簇,并将该簇中的数据点分配至距离其最近的数据点所在的非异常簇。After merging homologous clusters, determine whether the number of data points in the obtained cluster is less than or equal to 0%-5% of the total number of data points, if so, the cluster is regarded as an abnormal cluster, and the data points in the cluster are assigned to the non-anomalous cluster whose closest data point is located.
附图说明Description of drawings
图1是本发明的图像数据聚类方法的流程图;Fig. 1 is the flow chart of the image data clustering method of the present invention;
图2-a是本发明实施例中Pathbased数据集(k=7)时初步聚类结果示意图;Fig. 2-a is the schematic diagram of preliminary clustering result when Pathbased data set (k=7) in the embodiment of the present invention;
图2-b是本发明实施例中Pathbased数据集(k=11)时初步聚类结果示意图;Fig. 2-b is a schematic diagram of preliminary clustering results when the Pathbased data set (k=11) in the embodiment of the present invention;
图2-c是本发明实施例中Flame数据集(k=7)时初步聚类结果示意图;Figure 2-c is a schematic diagram of a preliminary clustering result when the Flame data set (k=7) in the embodiment of the present invention;
图2-d是本发明实施例中Flame数据集(k=18)时初步聚类结果示意图;Fig. 2-d is a schematic diagram of preliminary clustering results when the Flame data set (k=18) in the embodiment of the present invention;
图2-e是本发明实施例中Spiral数据集(k=6)时初步聚类结果示意图;Fig. 2-e is a schematic diagram of preliminary clustering results when the Spiral data set (k=6) in the embodiment of the present invention;
图2-f是本发明实施例中Spiral数据集(k=13)时初步聚类结果示意图;Fig. 2-f is a schematic diagram of preliminary clustering results when the Spiral data set (k=13) in the embodiment of the present invention;
图3-a是本发明实施例中Pathbased数据集最终聚类结果示意图;3-a is a schematic diagram of the final clustering result of the Pathbased data set in the embodiment of the present invention;
图3-b是本发明实施例中Flame数据集最终聚类结果示意图;Figure 3-b is a schematic diagram of the final clustering result of the Flame dataset in the embodiment of the present invention;
图3-c是本发明实施例中Spiral数据集最终聚类结果示意图;Figure 3-c is a schematic diagram of the final clustering result of the Spiral dataset in the embodiment of the present invention;
图4-a是本发明实施例中Olivetti Faces数据集最终聚类结果示意图;4-a is a schematic diagram of the final clustering result of the Olivetti Faces dataset in the embodiment of the present invention;
图4-b是采用DPC算法对Olivetti Faces数据集的最终聚类结果示意图;Figure 4-b is a schematic diagram of the final clustering result of the Olivetti Faces dataset using the DPC algorithm;
图5是本发明图像数据聚类装置的结构框图。FIG. 5 is a structural block diagram of the image data clustering apparatus of the present invention.
具体实施方式Detailed ways
下面结合附图对本发明的具体实施方式作进一步地说明。The specific embodiments of the present invention will be further described below with reference to the accompanying drawings.
方法实施例Method embodiment
本发明考虑到DPC算法及其他聚类算法在聚类图像数据时存在聚类效果不好的问题,提供了一种新的图像聚类方法,该方法在现有采用DPC进行图像聚类的基础上,提出了两点改进:一是采用加权欧式距离和互邻度计算数据点的密度;二是采用自身推荐策略,自动确定局部中心点,然后基于确定的局部中心点形成微簇,并提出将微簇与其邻簇中含有较少边界点的微簇合并的策略,得到最终的聚类结果。实现流程如图1所示。下面针对具体的实例对该方法的实现过程进行详细说明。Considering the problem that the DPC algorithm and other clustering algorithms have poor clustering effect when clustering image data, the present invention provides a new image clustering method, which is based on the existing DPC for image clustering. In the above, two improvements are proposed: one is to use weighted Euclidean distance and mutual adjacency to calculate the density of data points; the other is to use its own recommendation strategy to automatically determine local center points, and then form micro-clusters based on the determined local center points, and propose The strategy of merging the micro-clusters with the micro-clusters with fewer boundary points in the neighboring clusters, and get the final clustering result. The implementation process is shown in Figure 1. The implementation process of the method will be described in detail below with respect to specific examples.
1.获取待聚类的图像数据,并对图像数据进行预处理。1. Obtain the image data to be clustered and preprocess the image data.
获取的图像数据的特征数量一般都比较多。例如一幅90×150的图像,其特征个数就有13500个。为此,需要对图像数据做降维处理,本实施例采用PCA算法进行降维处理。对于Olivetti faces数据集而言,用PCA过滤掉累积贡献率超过90%以后的特征。为消除不同特征之间因取值范围不同给实验结果带来的影响,需要对数据作归一化处理。归一化处理采用的公式为:The acquired image data generally has a large number of features. For example, a 90×150 image has 13,500 features. To this end, it is necessary to perform dimension reduction processing on the image data. In this embodiment, the PCA algorithm is used for the dimension reduction processing. For the Olivetti faces dataset, PCA is used to filter out features whose cumulative contribution rate exceeds 90%. In order to eliminate the influence of different features on the experimental results due to different value ranges, it is necessary to normalize the data. The formula used for normalization is:
其中xij为第i个数据点在第j个属性上的取值,xj′为归一化后第i个数据点在第j个属性上的取值。where x ij is the value of the i-th data point on the j-th attribute, and x j ′ is the value of the i-th data point on the j-th attribute after normalization.
2.确定各数据点的密度。2. Determine the density of each data point.
对于给定数据集xn×m={x1,x2,...,xi,...,xn},n为数据点个数,m为数据点维数,xi=[xi1,xi2,...,xim],现有DPC算法中数据点密度的计算公式为:For a given dataset x n×m ={x 1 , x 2 ,..., xi ,...,x n }, n is the number of data points, m is the dimension of data points, x i =[ x i1 , x i2 , ..., x im ], the calculation formula of data point density in the existing DPC algorithm is:
其中,公式(1)针对的是样本数量较多的数据集,公式(2)针对的是样本数量较少的数据集。公式(1)和公式(2)中dc值的含义是一样的,都表示截断距离,dij表示两个数据点之间的欧式距离。Among them, formula (1) is aimed at a data set with a large number of samples, and formula (2) is aimed at a data set with a small number of samples. The meaning of the d c value in formula (1) and formula (2) is the same, and both represent the truncation distance, and d ij represents the Euclidean distance between two data points.
本发明首先以两个数据点之间的皮尔逊相关系数作为权值,去计算两个数据点之间加权后的欧式距离;然后提出互邻度的概念以确定一个数据点和其邻域中数据点联系的密切程度。互邻度的值越高,一个数据点和其邻域中数据点联系越密切,则这个数据点的密度就越大。最后结合加权欧式距离和互邻度去计算数据点的密度。The present invention firstly uses the Pearson correlation coefficient between two data points as a weight to calculate the weighted Euclidean distance between the two data points; How closely the data points are connected. The higher the value of the degree of mutuality, the closer a data point is to the data points in its neighborhood, and the greater the density of this data point. Finally, the weighted Euclidean distance and the mutual adjacency are combined to calculate the density of the data points.
1)计算两个数据点之间的加权欧式距离。1) Calculate the weighted Euclidean distance between two data points.
本实施例以皮尔逊相关系数作为影响因子,定义新的数据点间距离计算方式,如下所示:In this example, the Pearson correlation coefficient is used as the influencing factor to define a new calculation method for the distance between data points, as shown below:
其中表示数据点xp和数据点xq之间的加权欧式距离,表示数据点xp和数据点xq之间的皮尔逊相关系数,xpj表示数据点xp在其第j个属性上的取值。xqj表示数据点xq在其第j个属性上的取值。m为数据集的维度,即数据点特征的数量。in represents the weighted Euclidean distance between data point x p and data point x q , represents the Pearson correlation coefficient between the data point x p and the data point x q , and x pj represents the value of the data point x p on its jth attribute. x qj represents the value of the data point x q on its jth attribute. m is the dimension of the dataset, that is, the number of data point features.
2)计算数据点的互邻度和数据点密度。2) Calculate the mutual adjacency of data points and the density of data points.
在确定互邻度之前,首先说明互邻关系。众所周知,如果xp的k近邻中有xq,xq的k近邻中也有xp,那么xp和xq之间就构成互邻关系。基于此,提出数据点邻域的概念。以数据点xp为例,xp的邻域指的是,在xp的k近邻中与xp构成互邻关系的数据点的集合。即NG(xp)={xq|xq∈kNN(xp)∧xp∈kNN(xq)},kNN(xp)表示距离数据点xp最近的前k个数据点的集合,NG(xp)表示xp邻域中的数据点的集合。在此,本发明提出能够反映数据点密度大小的互邻度的概念,互邻度的计算方式为:Before determining the degree of mutual adjacency, the mutual adjacency relationship is first described. As we all know, if there is x q in the k-nearest neighbor of x p , and there is also x p in the k-nearest neighbor of x q , then x p and x q form a mutual neighbor relationship. Based on this, the concept of data point neighborhood is proposed. Taking a data point x p as an example, the neighborhood of x p refers to a set of data points that form a mutually adjacent relationship with x p in the k-nearest neighbors of x p . That is, NG(x p )={x q |x q ∈kNN(x p )∧x p ∈kNN(x q )}, kNN(x p ) represents the set of the first k data points closest to the data point x p , NG(x p ) represents the set of data points in the neighborhood of x p . Here, the present invention proposes the concept of mutual adjacency that can reflect the density of data points. The calculation method of mutual adjacency is:
MND(xp)的值越大,xp的密度就越大。The larger the value of MND(x p ), the greater the density of x p .
数据点密度与数据点之间的平均距离成反比,与数据点自身的互邻度成正比。那么结合加权欧式距离和互邻度提出了新的密度计算方式,具体采用的公式为:The density of data points is inversely proportional to the average distance between data points, and proportional to the degree of mutual adjacency of the data points themselves. Then, a new density calculation method is proposed by combining the weighted Euclidean distance and the mutual adjacency. The specific formula used is:
其中,|NG(Xp)|表示xp邻域中数据点的个数。where |NG(X p )| represents the number of data points in the neighborhood of x p .
3.用自荐策略自动选取局部中心点,以形成微簇。3. Use the self-recommendation strategy to automatically select local center points to form micro-clusters.
对于簇之间密度差异较大的数据集,DPC很容易忽略掉密度较小的簇的中心点。为此,本发明提出自动选取局部中心点的策略。若xp的密度大于其邻域中的点的密度,则将xp推荐为局部中心点(简称自荐策略),否则xp不推荐局部中心点。若一共有c个局部中心点,以LC表示一个数据集中所有局部中心点的集合。则,LC={lc1,lc2,...,lci,...,lcj,...,lcc}。For datasets with large density differences between clusters, DPC easily ignores the center points of clusters with low density. To this end, the present invention proposes a strategy of automatically selecting local center points. If the density of x p is greater than the density of points in its neighborhood, x p is recommended as a local center point (referred to as a self-recommendation strategy), otherwise x p does not recommend a local center point. If there are c local center points in total, LC represents the set of all local center points in a data set. Then, LC={lc 1 , lc 2 , ..., lc i , ..., lc j , ..., lc c }.
采用自荐策略,使得密度相对较小的簇中的点有很大的机会被选出来作为局部中心点,进而使密度相对较小的簇不会被忽略掉。The self-recommendation strategy is adopted, so that the points in the clusters with relatively small density have a great chance to be selected as the local center points, so that the clusters with relatively small density will not be ignored.
本实施例用DPC的方法分配剩余数据点,即将剩余数据点分配至密度比其高且距离其最近的局部中心点所在的簇。分配过程结束后,会形成多个“微簇”。至此,初步聚类任务完成。以C={c1,c2,...,ci,...,cj,...,cc}表示初步聚类结果(因为一共有c个局部中心,所以初步聚类结果一共有c个微簇),ci表示第i个微簇。由于图像数据在降维后,其维度也比较高,所以,本实施例只在3个被广泛使用的合成数据集上展示了初步聚类结果。所采用的数据集包括Pathbased数据集、Flame数据集和Spiral数据集。初步聚类结果如图2-a、2-b、2-c、2-d、2-e、2-f所示(需要说明的是:图中不同形状代表不同的微簇,但若两个不相邻的微簇是同一种形状,其也代表不同的微簇;黑色五角星表示不同微簇的局部中心点;黑色数字表示不同微簇的簇号)。作为其他实施方式,除了DPC方法外,还可采用其他现有的聚类方法将剩余数据点分配至距离其最近的局部中心点所在的簇,以形成微簇。In this embodiment, the DPC method is used to allocate the remaining data points, that is, the remaining data points are allocated to the cluster where the density is higher than that of the cluster where the nearest local center point is located. After the allocation process, multiple "microclusters" are formed. So far, the preliminary clustering task is completed. C = { c 1 , c 2 , . There are a total of c microclusters), and ci represents the ith microcluster. Since the dimension of image data is relatively high after dimension reduction, this embodiment only shows preliminary clustering results on three widely used synthetic datasets. The datasets used include Pathbased dataset, Flame dataset and Spiral dataset. The preliminary clustering results are shown in Figures 2-a, 2-b, 2-c, 2-d, 2-e, and 2-f (it should be noted that different shapes in the figure represent different microclusters, but if two Two non-adjacent microclusters are of the same shape, which also represent different microclusters; black five-pointed stars represent the local center points of different microclusters; black numbers represent the cluster numbers of different microclusters). As another implementation manner, in addition to the DPC method, other existing clustering methods can also be used to assign the remaining data points to the cluster where the nearest local center point is located, so as to form micro-clusters.
从图中可以看出,对于每个数据集来说,只要k值相对于其数据集中数据点总个数而言不要太大或者太小,则很容易找到一个合适的k值以得到正确初步聚类结果。正确的初步聚类结果是指对于每个微簇中的数据点而言,在真实情况下,这些数据点属于同一个簇。As can be seen from the figure, for each data set, as long as the k value is not too large or too small relative to the total number of data points in its data set, it is easy to find a suitable k value to get the correct initial Clustering results. The correct preliminary clustering result means that for the data points in each microcluster, in the real situation, these data points belong to the same cluster.
需要注意的是,为了保证所有的局部中心点可以组成一个稳定的簇结构,在分配完剩余数据点后便将局部中心点更改为微簇中所有数据点的均值点,若这样的均值点不存在,便将局部中心点更改为微簇中距离均值点最近的数据点,之所以这样做是因为,中心点为了避免自己成为边界点被删掉,便移至微簇中央。It should be noted that in order to ensure that all local center points can form a stable cluster structure, after the remaining data points are allocated, the local center point is changed to the mean point of all data points in the microcluster. If it exists, the local center point is changed to the data point closest to the mean point in the micro-cluster. The reason for this is that the center point is moved to the center of the micro-cluster in order to avoid being deleted as a boundary point.
4.合并微簇。4. Merge microclusters.
在本发明中,一个数据集是由多个簇组成的,一个簇是由多个微簇构成的。所以需要将微簇合并成簇,在进行微簇合并之前,先介绍邻簇关系。In the present invention, one data set is composed of multiple clusters, and one cluster is composed of multiple micro-clusters. Therefore, it is necessary to merge micro-clusters into clusters. Before merging micro-clusters, the adjacent cluster relationship is introduced.
若微簇ci中所有数据点的邻域中的数据点的并集与微簇cj的成员有交集,则认为cj和ci构成邻簇关系。以R(ci,cj)表示cj和ci之间的邻簇关系。则If the union of the data points in the neighborhood of all data points in the microcluster c i has an intersection with the members of the microcluster c j , it is considered that c j and c i constitute an adjacent cluster relationship. The neighbor relationship between c j and c i is represented by R(ci , c j ). but
若两个微簇满足邻簇关系,则称其为成对簇。若两个微簇是成对簇,并且也属于相同的簇,则这两个微簇是同源簇。若两个微簇是成对簇,但它们不在相同的簇中,则这两个微簇是非同源簇。若两个微簇满足邻簇关系,则这两个微簇互为邻簇。以Sni表示微簇ci所有邻簇的集合。If two microclusters satisfy the neighbor relationship, they are called pairwise clusters. Two microclusters are homologous clusters if they are pairwise clusters and also belong to the same cluster. Two microclusters are non-homologous if they are pairwise but not in the same cluster. If two microclusters satisfy the neighbor relationship, the two microclusters are neighbors to each other. Let S ni represent the set of all neighboring clusters of microcluster c i .
在微簇合并过程中,需要一层一层的删除边界点,所以,在此说明确定边界点的方法。边界点确定的依据是数据点的边界度,边界度越大,该点越有可能是边界点。对于任一数据点xp而言,xp相对其k近邻之间的距离与xp的近邻相对其近邻之间距离越大,则此点的边界度越大。此外,以加权欧式距离作为两个数据点之间的距离,那么数据点的边界度计算公式为In the process of merging microclusters, it is necessary to delete boundary points layer by layer, so the method for determining the boundary points is described here. The boundary point is determined based on the boundary degree of the data point. The greater the boundary degree, the more likely the point is a boundary point. For any data point x p , the greater the distance between x p and its k-nearest neighbors and the greater the distance between x p 's neighbors and its neighbors, the greater the boundary degree of this point. In addition, taking the weighted Euclidean distance as the distance between two data points, the calculation formula of the boundary degree of the data points is:
其中,in,
以数据点总数的一定比例作为边界点的数量。选取边界度较大的相应数量的点作为边界点。Take a percentage of the total number of data points as the number of boundary points. The corresponding number of points with larger boundary degree are selected as boundary points.
合并微簇的具体方法如下。以ci为例,它可以和多个微簇形成邻簇关系,也就是说,ci有多个邻簇。在这些邻簇中,有的和ci构成同源簇,有的和ci构成非同源簇。主要目的是从ci的邻簇中找到可以和ci构成同源簇的微簇,然后合并ci和这个微簇。事实上,在远离簇核心的区域(以X1表示),微簇之间比较容易构成非同源簇。在靠近簇核心的区域(以X2表示),微簇之间极易构成同源簇。所以在X1和X2这两个地方,以不同的方式来查找同源簇。假设X1由多层边界点组成,那么微簇合并的三个步骤如下所示。The specific method of merging microclusters is as follows. Taking ci as an example, it can form an adjacent cluster relationship with multiple microclusters, that is to say, ci has multiple adjacent clusters. Among these adjacent clusters, some form homologous clusters with ci , and some form non-homologous clusters with ci . The main purpose is to find a microcluster that can form a homologous cluster with ci from the neighboring clusters of ci, and then merge ci and this microcluster. In fact, in the region far from the cluster core ( represented by X1), it is easier to form non-homologous clusters between microclusters. In the region near the cluster core (indicated by X 2 ), it is easy to form homologous clusters between microclusters. So in the two places X 1 and X 2 , the homologous clusters are looked up in different ways. Assuming that X1 consists of multiple layers of boundary points, the three steps of microcluster merging are as follows.
1)删除一层边界点后再去确定各微簇之间是否形成邻簇关系。1) After deleting a layer of boundary points, it is determined whether a neighboring cluster relationship is formed between each micro-cluster.
对于很多数据集,其簇与簇之间联系紧密。那么,处于簇边界上的微簇,它们之间极易形成邻簇关系。一旦不同簇中的两个微簇之间形成邻簇关系,那么这两个微簇就有可能进一步构成同源簇。因此,在删除一层边界点后在去确定邻簇关系(依照设定比例将边界度较高的数据点作为边界点,设定比例一般为数据点总个数的10%-50%之间),这样可以使得在簇的边界区域,两个微簇之间原本成立的邻簇关系现在不成立了,自然地,它们也就不能构成同源簇。For many datasets, the clusters are closely related to each other. Then, the micro-clusters on the cluster boundary are very easy to form neighbor-cluster relationships between them. Once the neighbor relationship is formed between two microclusters in different clusters, it is possible for the two microclusters to further form homologous clusters. Therefore, after deleting a layer of boundary points, determine the adjacent cluster relationship (according to the set ratio, the data points with higher boundary degrees are used as boundary points, and the set ratio is generally between 10% and 50% of the total number of data points. ), so that in the boundary area of the cluster, the adjacent cluster relationship that was established between the two microclusters is no longer established, and naturally, they cannot form a homologous cluster.
2)查找同源簇。2) Find homologous clusters.
查找X1中的同源簇。这个阶段的目的是找到X1中所有微簇的同源簇。假设X1的某一层有M个边界点(X1是由层层边界点组成的),M个边界点中有一些局部中心点,以lcj代表其中的一个局部中心点。以ci代表和lcj对应的微簇,以cj表示ci的邻簇中的一个微簇。那么在这M个边界点被删掉之前,本发明需要去确定ci的同源簇是谁。如前所述,越是靠近簇中心的微簇,越有可能构成同源簇。而越靠近簇中心的微簇,所含边界点越少。所以,以|cj|表示初始状态下微簇cj中数据点的个数。以|cjt|表示在M个边界点被删掉之前,微簇cj中数据点的个数(其中t表示之前已经删除过了t次的边界点)。那么,的值越大,就说明cj越靠近簇中心,就说明cj和ci越有可能构成同源簇。此外,两个微簇的局部中心点的距离越近,这两个微簇就越有可能处于同一个簇。以结合度表示两个微簇之间是同源簇的可能性,结合度的值越大,两个微簇就越有可能是同源簇。结合度以如下方式计算Find homologous clusters in X1. The purpose of this stage is to find homologous clusters of all microclusters in X1. Assuming that a certain layer of X 1 has M boundary points (X 1 is composed of layer-by-layer boundary points), there are some local center points in the M boundary points, and l cj represents one of the local center points. C i represents the micro-cluster corresponding to l cj , and c j represents a micro-cluster in the neighboring clusters of c i . Then, before the M boundary points are deleted, the present invention needs to determine who the homologous cluster of ci is. As mentioned earlier, the closer the microclusters are to the center of the cluster, the more likely they are to form homologous clusters. The microclusters closer to the center of the cluster contain fewer boundary points. Therefore, |c j | represents the number of data points in the microcluster c j in the initial state. Let |c jt | denote the number of data points in the microcluster c j before the M boundary points are deleted (where t represents the boundary points that have been deleted t times before). So, The larger the value is, the closer c j is to the center of the cluster, and the more likely c j and c i are to form a homologous cluster. In addition, the closer the local center points of two microclusters are, the more likely the two microclusters are in the same cluster. The degree of binding is used to express the possibility that two microclusters are homologous clusters. The greater the value of the degree of binding, the more likely the two microclusters are homologous clusters. The degree of binding is calculated as follows
表示两个局部中心点之间的加权欧式距离。 Represents the weighted Euclidean distance between two local center points.
在ci的邻簇中,若cj和ci的结合度最高,则cj就是ci的同源簇。找到这M个边界点中包含的所有局部中心点的同源簇后,删掉这M个边界点。若这M个边界点中没有局部中心点,则直接删掉这M个边界点,接着去处理下一层边界点,直到删除指定次数的边界点后,才停止以上过程。若ci没有邻簇,则认为ci和ci构成同源簇。In the neighboring clusters of c i , if c j and c i have the highest degree of binding, then c j is the homologous cluster of c i . After finding the homologous clusters of all the local center points contained in the M boundary points, delete the M boundary points. If there is no local center point among the M boundary points, the M boundary points are directly deleted, and then the next layer of boundary points is processed, and the above process is not stopped until the specified number of boundary points are deleted. If ci has no adjacent clusters, it is considered that ci and ci constitute homologous clusters.
查找X2中的同源簇。以上删除的边界点,即为X1中的数据点。数据集中其余的点,即为X2中的点。以lci表示X2中的一个局部中心点,以ci表示和这个局部中心点对应的微簇。如前所述,X2中的微簇极易构成同源簇。所以,对于ci的所有邻簇来讲,只要其还有数据点存在,就认为其和ci是同源簇。找到X2中所有局部中心点所在的微簇的同源簇即可。Find homologous clusters in X2. The boundary points deleted above are the data points in X1. The rest of the points in the dataset are the points in X2. Let l ci represent a local center point in X 2 , and let ci represent the micro-cluster corresponding to this local center point. As mentioned earlier, microclusters in X2 can easily form homologous clusters. Therefore, for all the neighboring clusters of c i , as long as there are data points in it, it is considered to be a homologous cluster with c i . It is enough to find the homologous cluster of the microcluster where all the local center points in X2 are located.
然后,对于X1和X2中所有的同源簇,将同源簇中的两个微簇合并为一个簇。Then, for all homologous clusters in X1 and X2, merge the two microclusters in the homologous cluster into one cluster.
3)重新分配异常簇中的数据点3) Reassign data points in anomalous clusters
在经过步骤1)和步骤2)的微簇合并后,如果某个簇中数据点的个数小于数据点总量的设定比例,则认为该簇为异常簇,否则,认为其是非异常簇,该设定比例为0%-5%,本实施例选用1%。出现异常簇的原因或者是那些点本身是异常点,或者是那些点与簇的核心部分的联系相对来说没有那么紧密。不管是哪种原因,重新将异常簇中的数据点归类。即,对于异常簇中的每一个数据点,将其分配至距离其最近的数据点所在的簇(这个簇必须是非异常簇,否则,继续找距离其最近的下一个数据点所在的簇)。至此,整个微簇合并过程结束,得到最后的图像聚类结果。After the micro-clusters in steps 1) and 2) are merged, if the number of data points in a certain cluster is less than the set proportion of the total number of data points, the cluster is considered to be an abnormal cluster, otherwise, it is considered to be a non-abnormal cluster , the setting ratio is 0%-5%, and 1% is selected in this embodiment. Anomalous clusters appear either because those points are themselves outliers, or because those points are relatively less closely related to the core part of the cluster. Whatever the reason, reclassify the data points in the outlier clusters. That is, for each data point in the abnormal cluster, assign it to the cluster where the closest data point is located (this cluster must be a non-abnormal cluster, otherwise, continue to find the cluster where the next data point closest to it is located). At this point, the entire micro-cluster merging process ends, and the final image clustering result is obtained.
本发明在对图像聚类过程所采用的算法取名为:以边界点的流逝定微簇间结合的方向(The direction of the union between microclusters is determined by thepassing of boundary points,DUMDPBP),该算法的步骤可总结如下:The algorithm adopted by the present invention in the image clustering process is named: the direction of the union between microclusters is determined by the passing of boundary points (DUMDPBP) by the passage of boundary points. The steps can be summarized as follows:
装置实施例Device embodiment
本实施例提出的装置,如图5所示,包括处理器、存储器,存储器中存储有可在处理器上运行的计算机程序,所述处理器在执行计算机程序时实现上述方法实施例的方法。The apparatus proposed in this embodiment, as shown in FIG. 5 , includes a processor and a memory. The memory stores a computer program that can be executed on the processor, and the processor implements the methods of the above method embodiments when executing the computer program.
也就是说,以上方法实施例中的方法应理解可由计算机程序指令实现图像数据聚类方法的流程。可提供这些计算机程序指令到处理器,使得通过处理器执行这些指令产生用于实现上述方法流程所指定的功能。That is, it should be understood that the methods in the above method embodiments can be implemented by computer program instructions to implement the flow of the image data clustering method. The computer program instructions may be provided to a processor such that execution by the processor of the instructions results in the implementation of the functions specified by the above-described method flows.
本实施例所指的处理器是指微处理器MCU或可编程逻辑器件FPGA等的处理装置;The processor referred to in this embodiment refers to a processing device such as a microprocessor MCU or a programmable logic device FPGA;
本实施例所指的存储器包括用于存储信息的物理装置,通常是将信息数字化后再以利用电、磁或者光学等方式的媒体加以存储。例如:利用电能方式存储信息的各式存储器,RAM、ROM等;利用磁能方式存储信息的的各式存储器,硬盘、软盘、磁带、磁芯存储器、磁泡存储器、U盘;利用光学方式存储信息的各式存储器,CD或DVD。当然,还有其他方式的存储器,例如量子存储器、石墨烯存储器等等。The memory referred to in this embodiment includes a physical device for storing information. Usually, the information is digitized and then stored in an electrical, magnetic, or optical medium. For example: all kinds of memories that use electrical energy to store information, RAM, ROM, etc.; all kinds of memories that use magnetic energy to store information, hard disks, floppy disks, magnetic tapes, magnetic core memories, magnetic bubble memories, U disks; use optical methods to store information of all kinds of memory, CD or DVD. Of course, there are other ways of memory, such as quantum memory, graphene memory, and so on.
通过上述存储器、处理器以及计算机程序构成的装置,在计算机中由处理器执行相应的程序指令来实现,处理器可以搭载各种操作系统,如windows操作系统、linux系统、android、iOS系统等。The device constituted by the above-mentioned memory, processor and computer program is realized by the processor executing corresponding program instructions in the computer, and the processor can be equipped with various operating systems, such as windows operating system, linux system, android, iOS system, etc.
作为其他实施方式,装置还可以包括显示器,显示器用于将诊断结果展示出来,以供工作人员参考。As other implementation manners, the apparatus may further include a display, which is used for displaying the diagnosis results for the reference of the staff.
实验分析experiment analysis
为了更好说明本发明的效果,下面将本发明所采用的方法与现有的几种聚类方法进行比较,实验在Winsdow 7 64位操作系统,Matlab 2016b软件,32GB内存,Intel(R)Core(TM)i7-8700 CPU@3.20GHz的处理器上运行。实验采用了3个不同类型的2维数据集和一个高维的图像数据集,如表1所示,表1描述了数据集的基本信息。在实验开始之前,先将数据做归一化处理,以消除不同特征之间因取值范围的不同给实验结果所带来的影响。In order to better illustrate the effect of the present invention, the method adopted by the present invention is compared with several existing clustering methods below. (TM)i7-8700 CPU@3.20GHz processor. Three different types of 2-dimensional datasets and a high-dimensional image dataset were used in the experiments, as shown in Table 1, which describes the basic information of the datasets. Before the experiment starts, the data is normalized to eliminate the influence of the different value ranges between different features on the experimental results.
表1Table 1
对于前三个数据集,采用本发明中的DUMDPBP算法的聚类结果如图3-a、图3-b和图3-c所示,聚类的结果与相应数据集实际的类别结果一致,说明本发明所采用的DUMDPBP算法在Pathbased数据集、Flame数据集和Spiral数据集上有很好的效果。上述图中不同的形状代表不同的簇,对于Pathbased数据集,最终聚成了三类,对于Flame数据集,最终聚成了两类,对于Spiral数据集,最终聚成了三类。For the first three data sets, the clustering results using the DUMDPBP algorithm in the present invention are shown in Figure 3-a, Figure 3-b and Figure 3-c. The clustering results are consistent with the actual classification results of the corresponding data sets. It shows that the DUMDPBP algorithm adopted in the present invention has a good effect on the Pathbased data set, the Flame data set and the Spiral data set. The different shapes in the above figure represent different clusters. For the Pathbased dataset, they are finally clustered into three categories, for the Flame dataset, they are finally clustered into two categories, and for the Spiral dataset, they are finally clustered into three categories.
对Olivetti faces数据集,由于其是由大量的照片构成,而不是二维平面中一个个的点所组成,而且簇的数量多达几十个,所以无法像合成数据集那样用不同的形状展示其聚类结果。也没有其他的办法来图示簇数量众多且由照片组成的数据集。但是在后续章节会用评价指标来展示其聚类结果。即使如此,但我们展示了聚类结果中簇中心点的情况。需要说明的是Olivetti faces数据集有400张照片,在此仅选取了前100个人的照片做展示。图4-a是DUMDPBP算法的聚类结果。部分图片中右上角的白色圆点表示选取的局部中心点,从图中可以看出,每一行(即每一个簇),都至少有一个白色圆点出现,也就是说每个簇都选出来了局部中心点,这样每个簇都不会被忽略。图4-b是DPC算法的聚类结果,图片中右上角的白色圆点表示算法选取的簇中心点,从图中可以看出,第1个簇(第1行中的照片),第8个簇(第8行中的照片),第10个簇(第10行中的照片)都没有选出来簇中心点,也就是说DPC算法一定会将这三个簇中的数据点与别的簇中的数据点混淆。事实上,DPC算法就将第3个簇中的所有数据点和第10个簇中的8个数据点分配到了一个簇中,这是一个很糟糕的情况。通过比较可以发现,本发明相对于DPC算法有较大优势。For the Olivetti faces dataset, because it is composed of a large number of photos, rather than points in a two-dimensional plane, and the number of clusters is as many as dozens, it cannot be displayed in different shapes like a synthetic dataset. its clustering results. There is no other way to illustrate a dataset with a large number of clusters and consisting of photos. However, in subsequent chapters, the evaluation indicators will be used to show the clustering results. Even so, we show the case of cluster center points in the clustering results. It should be noted that the Olivetti faces dataset has 400 photos, and only the photos of the first 100 people are selected for display here. Figure 4-a is the clustering result of DUMDPBP algorithm. The white dots in the upper right corner of some pictures represent the selected local center points. It can be seen from the figure that at least one white dot appears in each row (that is, each cluster), that is to say, each cluster is selected. the local center points so that each cluster is not ignored. Figure 4-b is the clustering result of the DPC algorithm. The white dot in the upper right corner of the picture represents the cluster center point selected by the algorithm. It can be seen from the figure that the first cluster (the photo in the first row), the eighth cluster Clusters (photos in row 8) and 10 clusters (photos in row 10) did not select the cluster center point, which means that the DPC algorithm will definitely compare the data points in these three clusters with other Data points in clusters are confused. In fact, the DPC algorithm assigns all the data points in the 3rd cluster and 8 data points in the 10th cluster into one cluster, which is a bad situation. By comparison, it can be found that the present invention has great advantages over the DPC algorithm.
为了进一步得知DUMDPBP算法聚类结果的好与坏,本发明采用大家普遍使用的三个评价指标来测量聚类算法性能。三个评价指标为:调整互信息AMI、调整兰德系数ARI和Fowlkes-Mallows指数FMI。AMI取值范围是[-1,1],ARI取值范围为是[-1,1],FMI取值范围是[0,1]。在这三个指标上均是指标值越大聚类结果与实际情况就越吻合。设U={U1,…,UR}和V={V1,…,VC}分别表示数据集X={x1,x2,…,xn}的真实划分和聚类结果划分。令H(U)表示原始划分的熵,H(V)表示聚类结果划分的熵,MI(U,V)表示原始划分U与聚类结果划分V之间的互信息,E{·}表示原始划分U与聚类结果划分V之间的期望互信息。以a表示两个数据点真实情况下在同一个簇,聚类结果中也在同一个簇的情况。以b表示两个数据点真实情况下在同一个簇,聚类结果不在同一个簇的情况。以c表示两个数据点真实情况下不在同一个簇,聚类结果在同一个簇的情况。以d表示两个数据点真实情况下不在同一个簇,聚类结果中也不在同一个簇的情况。则上述三个评价指标以如下方式计算。In order to further know whether the clustering result of the DUMDPBP algorithm is good or bad, the present invention adopts three evaluation indexes commonly used by everyone to measure the performance of the clustering algorithm. Three evaluation indicators are: adjusted mutual information AMI, adjusted Rand coefficient ARI and Fowlkes-Mallows index FMI. The value range of AMI is [-1, 1], the value range of ARI is [-1, 1], and the value range of FMI is [0, 1]. In these three indicators, the larger the index value, the more consistent the clustering result is with the actual situation. Let U = { U 1 , . . Let H(U) represent the entropy of the original partition, H(V) represent the entropy of the clustering result partition, MI(U, V) represent the mutual information between the original partition U and the clustering result partition V, and E{·} represents The expected mutual information between the original partition U and the clustering result partition V. Use a to indicate that the two data points are in the same cluster in reality, and the clustering result is also in the same cluster. Let b represent the fact that two data points are in the same cluster in reality, but the clustering result is not in the same cluster. Use c to indicate that the two data points are not in the same cluster in reality, and the clustering result is in the same cluster. Denote by d that the two data points are not in the same cluster in reality, nor are they in the same cluster in the clustering result. Then the above three evaluation indicators are calculated as follows.
按照上述三个指标,将本发明所提到的算法(称为DUMDPBP)与现有的7种聚类算法在上述4种数据集上分别进行比较,对比结果如表2所示。According to the above three indicators, the algorithm mentioned in the present invention (referred to as DUMDPBP) is compared with the existing seven clustering algorithms on the above four data sets, and the comparison results are shown in Table 2.
表2Table 2
表3table 3
在表2中,DUMDPBP在4个数据集上的参数如表3所示,包括数据点的邻居数量(k),查找同源簇的次数(N1),确定邻簇关系时边界点占数据点总量的比例(N2),查找同源簇时边界点占剩余数据点总量的比例(N3)。对于Spiral数据集,由于簇间有明显间隔,所以,只要两个微簇构成了邻簇关系,它们就是同源簇。所以无需查找同源簇,将N1设置为0;对于Flame数据集,由于簇之间间隔不明显,所以N2设置的相对很大;对于N3的值,一般情况小于0.5即可。DPC算法在Olivetti faces数据集上的聚类结果采用的调整参数dc=1.1125。其余参数的调整全部参照作者Rui Liu等人在期刊《Information Sciences》上发表的论文:Shared-nearest-neighbor-based clustering by fast search and find of densitypeaks。In Table 2, the parameters of DUMDPBP on the 4 datasets are shown in Table 3, including the number of neighbors of the data point (k), the number of times to find homologous clusters (N1), and the number of data points that border points occupy when determining the relationship between neighbor clusters. The proportion of the total amount (N2), the proportion of boundary points to the total amount of remaining data points when searching for homologous clusters (N3). For the Spiral dataset, since there are obvious gaps between clusters, as long as two microclusters form an adjacent cluster relationship, they are homologous clusters. Therefore, there is no need to search for homologous clusters, and N 1 is set to 0; for the Flame data set, since the interval between clusters is not obvious, N 2 is set relatively large; for the value of N 3 , it is generally less than 0.5. The adjustment parameter dc = 1.1125 is adopted for the clustering result of the DPC algorithm on the Olivetti faces dataset. The adjustment of the remaining parameters all refer to the paper published by the author Rui Liu et al in the journal "Information Sciences": Shared-nearest-neighbor-based clustering by fast search and find of densitypeaks.
从表2中可以看出,对于Spiral这样任意形状的数据集,DUMDPBP和其他5种聚类算法在三个指标上的值都为1,表示这六种聚类算法的聚类结果与真实情况完全一样;对于Flame这样簇之间无明显分隔界限的数据集,DUMDPBP,FKNN-DPC和DPC在三个指标上,全都取得了指标的上限值1,表明聚类结果和真实情况完全一致;对于Pathbased数据集,只有DUMDPBP算法在三个指标上的值都为1,而其他算法的效果相对较差;对于Olivetti faces数据集而言,DUMDPBP算法相较于其他算法,在三个指标上的表现都优于其他算法。因此,本发明的算法不仅在合成数据集上聚类效果好,在真实世界中的图像数据集上聚类效果也好。因此本发明提出的DUMDPBP算法适用于多种类型的数据集,应用前景非常好。It can be seen from Table 2 that for a dataset of arbitrary shape such as Spiral, the values of DUMDPBP and the other five clustering algorithms are all 1 on the three indicators, indicating that the clustering results of these six clustering algorithms are consistent with the real situation. Exactly the same; for a dataset like Flame, where there is no obvious separation boundary between clusters, DUMDPBP, FKNN-DPC and DPC all achieved the upper limit of 1 in the three indicators, indicating that the clustering results are completely consistent with the real situation; For the Pathbased data set, only the DUMDPBP algorithm has a value of 1 on the three indicators, while the effect of other algorithms is relatively poor; for the Olivetti faces data set, compared with other algorithms, the DUMDPBP algorithm is in the three indicators. outperformed other algorithms. Therefore, the algorithm of the present invention has good clustering effect not only on synthetic datasets, but also on image datasets in the real world. Therefore, the DUMDPBP algorithm proposed by the present invention is suitable for various types of data sets, and the application prospect is very good.
本发明结合互邻度和加权欧式距离,提出新的密度计算方式,避免人为设置截断距离;接下来提出自荐局部中心点的策略,然后分配剩余点,使得密度相对较小的簇也能选出局部中心点,使密度相对较小的簇不会被忽略掉;最后在层层删除边界点的过程中,查找同源簇,合并同源簇成簇,完成聚类任务。并通过实验发现,本发明所提算法在不同分布类型的数据集上都取得了较好的结果,应用领域广泛。The invention combines mutual adjacency and weighted Euclidean distance to propose a new density calculation method to avoid artificially setting cutoff distances; next, a strategy of self-recommending local center points is proposed, and then remaining points are allocated, so that clusters with relatively small density can also be selected Local center point, so that clusters with relatively small density will not be ignored; finally, in the process of deleting boundary points layer by layer, look for homologous clusters, merge homologous clusters into clusters, and complete the clustering task. And through experiments, it is found that the algorithm proposed in the present invention has achieved good results on data sets of different distribution types, and has a wide range of applications.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010260470.7A CN111626321B (en) | 2020-04-03 | 2020-04-03 | Image data clustering method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010260470.7A CN111626321B (en) | 2020-04-03 | 2020-04-03 | Image data clustering method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111626321A true CN111626321A (en) | 2020-09-04 |
CN111626321B CN111626321B (en) | 2023-06-06 |
Family
ID=72271810
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010260470.7A Active CN111626321B (en) | 2020-04-03 | 2020-04-03 | Image data clustering method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111626321B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112907602A (en) * | 2021-01-28 | 2021-06-04 | 中北大学 | Three-dimensional scene point cloud segmentation method based on improved K-nearest neighbor algorithm |
CN113344128A (en) * | 2021-06-30 | 2021-09-03 | 福建师范大学 | Micro-cluster-based industrial Internet of things adaptive stream clustering method and device |
CN115858002A (en) * | 2023-02-06 | 2023-03-28 | 湖南大学 | Binary code similarity detection method and system based on graph comparison learning and storage medium |
CN117152543A (en) * | 2023-10-30 | 2023-12-01 | 山东浪潮科学研究院有限公司 | An image classification method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000028441A2 (en) * | 1998-11-11 | 2000-05-18 | Microsoft Corporation | A density-based indexing method for efficient execution of high-dimensional nearest-neighbor queries on large databases |
CN105139035A (en) * | 2015-08-31 | 2015-12-09 | 浙江工业大学 | Mixed attribute data flow clustering method for automatically determining clustering center based on density |
CN109409400A (en) * | 2018-08-28 | 2019-03-01 | 西安电子科技大学 | Merge density peaks clustering method, image segmentation system based on k nearest neighbor and multiclass |
CN110929758A (en) * | 2019-10-24 | 2020-03-27 | 河海大学 | A Clustering Algorithm for Fast Searching and Finding Density Peaks for Complex Data |
-
2020
- 2020-04-03 CN CN202010260470.7A patent/CN111626321B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000028441A2 (en) * | 1998-11-11 | 2000-05-18 | Microsoft Corporation | A density-based indexing method for efficient execution of high-dimensional nearest-neighbor queries on large databases |
CN105139035A (en) * | 2015-08-31 | 2015-12-09 | 浙江工业大学 | Mixed attribute data flow clustering method for automatically determining clustering center based on density |
CN109409400A (en) * | 2018-08-28 | 2019-03-01 | 西安电子科技大学 | Merge density peaks clustering method, image segmentation system based on k nearest neighbor and multiclass |
CN110929758A (en) * | 2019-10-24 | 2020-03-27 | 河海大学 | A Clustering Algorithm for Fast Searching and Finding Density Peaks for Complex Data |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112907602A (en) * | 2021-01-28 | 2021-06-04 | 中北大学 | Three-dimensional scene point cloud segmentation method based on improved K-nearest neighbor algorithm |
CN113344128A (en) * | 2021-06-30 | 2021-09-03 | 福建师范大学 | Micro-cluster-based industrial Internet of things adaptive stream clustering method and device |
CN113344128B (en) * | 2021-06-30 | 2023-06-23 | 福建师范大学 | A micro-cluster-based adaptive flow clustering method and device for industrial internet of things |
CN115858002A (en) * | 2023-02-06 | 2023-03-28 | 湖南大学 | Binary code similarity detection method and system based on graph comparison learning and storage medium |
CN115858002B (en) * | 2023-02-06 | 2023-04-25 | 湖南大学 | Binary code similarity detection method and system based on graph comparison learning and storage medium |
CN117152543A (en) * | 2023-10-30 | 2023-12-01 | 山东浪潮科学研究院有限公司 | An image classification method, device, equipment and storage medium |
CN117152543B (en) * | 2023-10-30 | 2024-06-07 | 山东浪潮科学研究院有限公司 | Image classification method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111626321B (en) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111626321A (en) | Image data clustering method and device | |
CN103839261B (en) | SAR image segmentation method based on decomposition evolution multi-objective optimization and FCM | |
CN106570873B (en) | A medical image segmentation method | |
CN106339416B (en) | A Clustering Method for Educational Data Based on Fast Grid Search for Density Peaks | |
Sun et al. | Density peaks clustering based on k-nearest neighbors and self-recommendation | |
CN102194133B (en) | Data-clustering-based adaptive image SIFT (Scale Invariant Feature Transform) feature matching method | |
CN107103336A (en) | A kind of mixed attributes data clustering method based on density peaks | |
CN106845536B (en) | Parallel clustering method based on image scaling | |
Zhou et al. | ECMdd: Evidential c-medoids clustering with multiple prototypes | |
CN107122851A (en) | A kind of lake water systems connects engineering proposal optimization model Sensitivity Analysis Method | |
CN107833224A (en) | A kind of image partition method based on multi-level region synthesis | |
CN107909111B (en) | Multi-level graph clustering partitioning method for residential area polygons | |
CN111310821A (en) | Multi-view feature fusion method, system, computer equipment and storage medium | |
CN104850867A (en) | Object identification method based on intuitive fuzzy c-means clustering | |
CN115546525A (en) | Multi-view clustering method and device, electronic equipment and storage medium | |
Li et al. | A new density peak clustering algorithm based on cluster fusion strategy | |
CN109948720A (en) | A Density-Based Hierarchical Clustering Method | |
Karbauskaitė et al. | Fractal-based methods as a technique for estimating the intrinsic dimensionality of high-dimensional data: a survey | |
CN109636809A (en) | A kind of image segmentation hierarchy selection method based on scale perception | |
CN113158817B (en) | An Objective Weather Typing Method Based on Fast Density Peak Clustering | |
CN111738516B (en) | Social network community discovery system through local distance and node rank optimization function | |
CN111738514B (en) | A social network community discovery method using local distance and node rank optimization functions | |
CN118229980A (en) | Image segmentation method and system based on attribute network random block model | |
CN110503138A (en) | A Multi-view Fuzzy Clustering Algorithm Based on Entropy and Distance Weighting | |
Yan et al. | Density-based Clustering using Automatic Density Peak Detection. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |