CN111626321A - Image data clustering method and device - Google Patents

Image data clustering method and device Download PDF

Info

Publication number
CN111626321A
CN111626321A CN202010260470.7A CN202010260470A CN111626321A CN 111626321 A CN111626321 A CN 111626321A CN 202010260470 A CN202010260470 A CN 202010260470A CN 111626321 A CN111626321 A CN 111626321A
Authority
CN
China
Prior art keywords
cluster
points
data
micro
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010260470.7A
Other languages
Chinese (zh)
Other versions
CN111626321B (en
Inventor
孙林
秦小营
孙全党
李文凤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Normal University
Original Assignee
Henan Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan Normal University filed Critical Henan Normal University
Priority to CN202010260470.7A priority Critical patent/CN111626321B/en
Publication of CN111626321A publication Critical patent/CN111626321A/en
Application granted granted Critical
Publication of CN111626321B publication Critical patent/CN111626321B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a device for clustering image data, and belongs to the technical field of image processing. The invention firstly proposes a self recommendation strategy to determine a local central point, eliminates the subjectivity of manually selecting a cluster center, and solves the problem that the central point of a cluster with lower density is neglected; distributing residual data points by taking the local central point as a micro-cluster central point so as to generate a plurality of micro-clusters; and finally, a micro-cluster combination strategy is provided, and micro-clusters with fewer boundary points in the micro-clusters and the adjacent clusters are combined, because the micro-clusters are closer to the center of the cluster and are more likely to be in the same cluster with the micro-cluster, the clustering result of the invention is more accurate.

Description

Image data clustering method and device
Technical Field
The invention relates to a method and a device for clustering image data, and belongs to the technical field of image processing.
Background
Clustering analysis is an unsupervised classification method. The method aims to divide a data set without classification labels into a plurality of clusters, ensures that objects in the clusters are similar to each other and objects between the clusters are dissimilar, and can be applied to the fields of optimization analysis, image segmentation, bioinformatics and the like.
The image data can be classified by Clustering method, and the Clustering algorithm (DPC) is a kind of Clustering proposed by Alex and LaioThe methods, because of their effectiveness, novelty and simplicity, have become more and more widely used in recent years. The proposal of DPC is based on two assumptions: (1) the cluster center is surrounded by neighbor points with lower density than the cluster center; (2) the distance between the cluster centers is relatively large. To this end, two concepts are proposed: (1) data point xiLocal density of in rhoiRepresents; (2) ratio data point xiHigh density and distance xiNearest point to xiA distance ofiAnd (4) showing. On the basis of the above two assumptions, the authors of the DPC algorithm propose the general idea of the algorithm: first selecting rhoiValue (c),iThe data point with the larger value is taken as the cluster center point, and then the non-center point is allocated to the cluster with the density higher than the self and the data point closest to the self. Although the DPC has good performances in all aspects, certain defects still exist: cut-off distance dcThe choice of (a) affects the accuracy of the clustering, i.e. dcThe small variation of the value influences the density of data points and the selection of a clustering center, and further influences the final clustering result; the clustering center needs to be selected manually, has certain subjective factors and influences the objectivity of the final clustering result. And for data sets with large density differences between clusters, DPC tends to ignore the cluster centers of the less dense clusters.
On the basis, many scholars make corresponding improvements on the density peak value clustering algorithm. For example: the high pozzing and the like provide a density peak value clustering algorithm based on density proportion, namely R-DPC, the algorithm introduces the density proportion into DPC, the identification degree of a cluster with lower density in data is improved by calculating the density ratio of sample data, and the accuracy of integral clustering is further improved; schroewina et al propose an improved density peak value clustering algorithm combined with k nearest neighbor, and provide a new local density measurement method to describe the distribution condition of each sample in space, thereby effectively improving the clustering quality; xie et al propose a density peak search and point distribution algorithm based on fuzzy weighted k-nearest neighbor technology to solve the problem of inconsistency of data point density measurement methods in DPC algorithm. The improvement improves the performance of the density peak clustering algorithm to a certain extent, but the improvement is difficult to deal with the data set with variable density and has higher time complexity.
Disclosure of Invention
The invention aims to provide a method and a device for clustering image data, which are used for solving the problem of poor image data clustering effect at present.
The present invention provides a method for clustering image data to solve the above technical problems, the method comprising the steps of:
1) acquiring image data to be clustered, performing dimension reduction processing on the image data, taking each image as a data point, and determining the density of each data point;
2) for each data point, judging whether the density of the data point is greater than that of the data points in the neighborhood, if so, recommending the data point as a local central point, taking the local central point as a center, distributing the rest data points to the local central point, and generating a micro-cluster;
3) determining the number of boundary points according to a first set proportion, selecting a corresponding number of points with larger boundary degree as the boundary points, and determining whether an adjacent cluster relation is formed between two micro clusters under the condition of not considering the boundary points;
4) determining boundary points according to a second set proportion, judging whether the determined boundary points contain local central points, if so, selecting the micro-cluster with the highest combination degree from the adjacent clusters of the micro-cluster where the local central points are located for combination, and deleting the boundary points;
and continuously determining boundary points from the rest points again according to a second set proportion to obtain a next layer of boundary points, judging whether the next layer of boundary points contain local central points, if so, selecting the micro-cluster with the highest combination degree from the adjacent clusters of the micro-cluster with the local central points to combine, deleting the layer of boundary points until the boundary points with the set times are deleted, and then combining the micro-cluster with the adjacent clusters if data points exist in the adjacent clusters.
The invention also provides a clustering device of image data, which comprises a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor is coupled with the memory, and the processor executes the computer program to realize the clustering method of the image data.
According to the image clustering method, the local central point is selected by adopting a self-recommendation strategy, so that the subjectivity of manually selecting the cluster center is eliminated, and the problem that the central point of a cluster with low density is ignored is solved; then, generating a micro-cluster by taking the local central point as the central point of the micro-cluster; and then, providing a micro-cluster combination method, and in the process of deleting boundary points layer by layer, selecting a micro-cluster with the highest combination degree from the adjacent clusters of one micro-cluster for combination, wherein the micro-cluster is closer to the center of the cluster and is more likely to be in the same cluster with the micro-cluster, so that the final clustering result is more accurate.
Further, in order to solve the problem that the truncation distance is difficult to select when calculating the density, the density value of each point in the step 1) is determined according to the mutual adjacency of the point and the mean value of the distances between the point and the middle points in the neighborhood.
Furthermore, the invention provides a formula for calculating the density of the data points, and the calculation formula of the density value is as follows:
Figure BDA0002439103260000031
Figure BDA0002439103260000032
where ρ (x)p) Represents the data point xpDensity of (a), MND (x)p) Represents the data point xpMutual adjacency of NG (x)p) Represents the data point xpIs in the neighborhood, | NG (x)p) | represents a data point xpThe number of data points in the neighborhood,
Figure BDA0002439103260000033
represents the data point xpTo data point xqWeighted Euclidean distance of (1), at data point xpOf k neighbors, if xqIs x of the data pointpThe first neighbor, thenpqIs that the larger the value of (A), xqTo xpThe farther away; if xpIs x of the data pointqThe first one of the neighbors is,qpthe value of (c) is.
Further, in order to reflect the internal relationship between the data points, the distance between two data points is a weighted euclidean distance, and the pearson correlation coefficient of the two points is used as a weight.
Further, in order to accurately determine whether a data point is a boundary point, a concept of data point boundary degree is provided, the larger the value of the boundary degree is, the more likely the data point becomes a boundary point, and a calculation formula of the boundary degree of the data point is as follows:
Figure BDA0002439103260000034
Figure BDA0002439103260000035
wherein bp (x)p) Represents the data point xpBoundary degree, | kNN (x)p) | represents data point XpNumber of neighbors of, kNN (x)p) Represents the data point xpThe number of k neighbor points of (a),
Figure BDA0002439103260000041
is a data point XpAnd data point XqPearson correlation coefficient between, m is the dimensionality of the data set.
Further, in order to accurately determine the relationship between the micro-clusters, the judgment basis of the adjacent micro-clusters is as follows:
Figure BDA0002439103260000042
wherein NG (x)p) Represents the data point xpNeighborhood of ciDenotes the ith micro-cluster formed, R (c)i,cj) Represents a micro cluster ciAnd micro-cluster cjForm a neighboring cluster relationship.
Further, after the micro-cluster is formed, the method changes the local central point of the micro-cluster into the mean point of all data points in the micro-cluster, and if the mean point does not exist, the data point closest to the mean point in the micro-cluster is changed.
Further, the invention provides a calculation formula of the binding degree, wherein the calculation formula of the binding degree is as follows:
Figure BDA0002439103260000043
|cji represents a micro-cluster c in an initial statejThe number of data points, | cjtI represents the micro-cluster c after deleting the boundary point of the t layerjThe number of the data points in the data stream,
Figure BDA0002439103260000044
the larger the value of (A), cjThe closer to the cluster core region, cjAnd ciThe higher the degree of bonding of (a),
Figure BDA0002439103260000045
representing a weighted euclidean distance between two local center points.
Further, in order to avoid clustering some abnormal points individually, the method further includes a process of adjusting the clusters after the micro-cluster merging:
and after merging the homologous clusters, judging whether the number of the data points in the obtained cluster is less than or equal to 0-5% of the total number of the data points, if so, taking the cluster as an abnormal cluster, and distributing the data points in the cluster to the non-abnormal cluster where the data points closest to the cluster are located.
Drawings
FIG. 1 is a flow chart of an image data clustering method of the present invention;
fig. 2-a is a schematic diagram of a preliminary clustering result when a Pathbased data set (k is 7) is obtained in the embodiment of the present invention;
fig. 2-b is a schematic diagram of a preliminary clustering result when the Pathbased dataset (k is 11) is obtained in the embodiment of the present invention;
fig. 2-c is a schematic diagram of the preliminary clustering result of the frame data set (k 7) in the embodiment of the present invention;
fig. 2-d is a schematic diagram of the preliminary clustering result of the frame data set (k 18) in the embodiment of the present invention;
fig. 2-e is a schematic diagram of a preliminary clustering result of a helical dataset (k is 6) in the embodiment of the present invention;
fig. 2-f is a schematic diagram of a preliminary clustering result of a helical dataset (k 13) according to an embodiment of the present invention;
FIG. 3-a is a schematic diagram of a final clustering result of the Pathbased data set in the embodiment of the present invention;
FIG. 3-b is a diagram illustrating the final clustering result of the frame data set in the embodiment of the present invention;
3-c are schematic diagrams of final clustering results of the Spiral data set in the embodiment of the invention;
FIG. 4-a is a schematic diagram of a final clustering result of Olivetti Faces data sets in the embodiment of the present invention;
FIG. 4-b is a diagram illustrating the final clustering result of Olivetti Faces data sets by DPC algorithm;
FIG. 5 is a block diagram showing the structure of an image data clustering apparatus according to the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the drawings.
Method embodiment
The invention provides a new image clustering method in consideration of the problem that the clustering effect of the DPC algorithm and other clustering algorithms is poor when clustering image data, and the method provides two improvements on the basis of the existing DPC image clustering: firstly, calculating the density of data points by adopting a weighted Euclidean distance and mutual adjacency calculation method; and secondly, automatically determining a local central point by adopting a self-recommendation strategy, forming a micro-cluster based on the determined local central point, and providing a strategy for combining the micro-cluster with micro-clusters containing fewer boundary points in adjacent clusters to obtain a final clustering result. The implementation flow is shown in fig. 1. The implementation of the method is described in detail below with respect to specific examples.
1. And acquiring image data to be clustered, and preprocessing the image data.
The number of features of the acquired image data is generally large. For example, a 90 × 150 image has 13500 features. Therefore, the dimension reduction processing needs to be performed on the image data, and the PCA algorithm is adopted for the dimension reduction processing in the embodiment. For Olivetti faces datasets, PCA is used to filter out features after the cumulative contribution rate exceeds 90%. In order to eliminate the influence on the experimental result caused by different value ranges among different characteristics, the data needs to be normalized. The normalization process uses the formula:
Figure BDA0002439103260000061
wherein xijIs the value of the ith data point on the jth attribute, xj' is the value of the ith data point on the jth attribute after normalization.
2. The density of each data point is determined.
For a given data set xn×m={x1,x2,...,xi,...,xnN is the number of data points, m is the dimension of the data points, xi=[xi1,xi2,...,xim]The calculation formula of the data point density in the existing DPC algorithm is as follows:
Figure BDA0002439103260000062
Figure BDA0002439103260000063
wherein, formula (1) is for the data set with a larger number of samples, and formula (2) is for the data set with a smaller number of samples. D in formula (1) and formula (2)cThe values are identical and all represent the truncation distance, dijRepresenting the euclidean distance between two data points.
Firstly, calculating a weighted Euclidean distance between two data points by taking a Pearson correlation coefficient between the two data points as a weight; the concept of mutual proximity is then proposed to determine how closely a data point is related to the data points in its neighborhood. The higher the value of mutual adjacency is, the more closely a data point is connected with the data points in its neighborhood, and the greater the density of the data point is. And finally, calculating the density of the data points by combining the weighted Euclidean distance and the mutual adjacency.
1) A weighted euclidean distance between the two data points is calculated.
In the embodiment, the pearson correlation coefficient is used as an influence factor, and a new calculation manner of the distance between data points is defined as follows:
Figure BDA0002439103260000064
Figure BDA0002439103260000071
Figure BDA0002439103260000072
wherein
Figure BDA0002439103260000073
Represents the data point xpAnd data point xqThe weighted euclidean distance between them,
Figure BDA0002439103260000074
represents the data point xpAnd data point xqPearson's correlation coefficient between, xpjRepresents the data point xpThe value on its jth attribute. x is the number ofqjRepresents the data point xqThe value on its jth attribute. m is the dimension of the data set, i.e. the number of data point features.
2) Mutual adjacency of data points and data point density are calculated.
Before determining mutual proximity, mutual proximity relations are first explained. Public placeIt is well known that if xpK of (a) has x in the neighborhoodq,xqK of (a) also has x in the neighborhoodpThen xpAnd xqForm a mutual adjacent relation. Based on this, the concept of neighborhood of data points is proposed. By data point xpFor example, xpIs referred to as being in xpK in the neighborhood of (a) and xpA collection of data points that constitute a mutual adjacency. Namely NG (x)p)={xq|xq∈kNN(xp)∧xp∈kNN(xq)},kNN(xp) Represents a distance data point xpThe set of the most recent top k data points, NG (x)p) Denotes xpA set of data points in a neighborhood. The invention provides a concept of mutual adjacency capable of reflecting the data point density, and the mutual adjacency is calculated in the following way:
Figure BDA0002439103260000075
MND(xp) The larger the value of (A), xpThe greater the density of (a).
The density of data points is inversely proportional to the average distance between data points and directly proportional to the mutual proximity of the data points themselves. Then, a new density calculation mode is proposed by combining the weighted euclidean distance and the mutual adjacency, and the formula specifically adopted is as follows:
Figure BDA0002439103260000076
wherein, | NG (X)p) I represents xpThe number of data points in the neighborhood.
3. The local center points are automatically selected using a self-recommended strategy to form micro-clusters.
For data sets with large density differences between clusters, the DPC can easily ignore the center points of the less dense clusters. Therefore, the invention provides a strategy for automatically selecting the local central point. If xpIs greater than the density of points in its neighborhood, x is thenpRecommending a local central point (self-recommendation strategy for short), otherwise xpLocal center points are not recommended. If there are a total of c partsThe center point represents the set of all local center points in one data set with LC. Then, LC ═ LC1,lc2,...,lci,...,lcj,...,lcc}。
By adopting the self-recommendation strategy, points in the cluster with relatively low density have great chance to be selected as local central points, and further the cluster with relatively low density cannot be ignored.
The present embodiment uses the DPC method to assign the remaining data points, i.e., to the cluster having the local center point with higher density and closest distance. After the allocation process is completed, a plurality of "micro-clusters" are formed. And finishing the primary clustering task. With C ═ C1,c2,...,ci,...,cj,...,ccDenotes the preliminary clustering result (since there are c local centers, the preliminary clustering result has c micro-clusters), ciIndicates the ith micro cluster. Since the dimensionality of the image data is higher after dimensionality reduction, the preliminary clustering result is shown on only 3 widely used synthetic data sets in this embodiment. The datasets employed include Pathbased datasets, Flame datasets and Spiral datasets. The preliminary clustering results are shown in FIGS. 2-a, 2-b, 2-c, 2-d, 2-e, and 2-f (it should be noted that different shapes in the figure represent different micro-clusters, but if two non-adjacent micro-clusters are the same shape, they also represent different micro-clusters, black pentagons represent local center points of different micro-clusters, and black numbers represent cluster numbers of different micro-clusters). As other embodiments, besides the DPC method, other existing clustering methods can be used to assign the remaining data points to the cluster where the local center point closest thereto is located, so as to form a micro-cluster.
It can be seen from the figure that for each data set, it is easy to find a suitable k value to get the correct preliminary clustering result, as long as the k value is not too large or too small relative to the total number of data points in its data set. The correct preliminary clustering result means that for data points in each micro-cluster, in the real case, the data points belong to the same cluster.
It should be noted that, in order to ensure that all local center points can form a stable cluster structure, after the remaining data points are distributed, the local center points are changed to the average values of all data points in the micro-cluster, and if such average values do not exist, the local center points are changed to the data points closest to the average values in the micro-cluster.
4. And merging the micro-clusters.
In the present invention, one data set is composed of a plurality of clusters, and one cluster is composed of a plurality of micro-clusters. Therefore, micro-clusters need to be merged into clusters, and the relationship between adjacent clusters is introduced before merging the micro-clusters.
If micro-cluster ciUnion of data points in the neighborhood of all data points in and micro-cluster cjIf the members of (c) have intersection, then c is consideredjAnd ciAnd forming a neighboring cluster relationship. With R (c)i,cj) Denotes cjAnd ciAdjacent cluster relationship between them. Then
Figure BDA0002439103260000091
If two micro-clusters satisfy the adjacent cluster relationship, they are called paired clusters. Two micro-clusters are homologous clusters if they are paired clusters and also belong to the same cluster. If two micro-clusters are paired clusters, but they are not in the same cluster, then the two micro-clusters are non-homologous clusters. And if the two micro-clusters meet the adjacent cluster relationship, the two micro-clusters are adjacent to each other. With SniRepresents a micro cluster ciThe set of all neighboring clusters.
In the course of micro-cluster merging, deletion boundary points are required layer by layer, and therefore, a method of determining boundary points is described herein. The basis for determining the boundary point is the boundary degree of the data point, and the larger the boundary degree is, the more likely the point is to be a boundary point. For any data point xpIn other words, xpWith respect to the distance between k neighbors and xpThe larger the distance between the adjacent neighbor of (a) and its neighbor (b), the larger the boundary degree of the point. In addition, the weighted Euclidean distance is used for makingIs the distance between two data points, the boundary degree of the data points is calculated as
Figure BDA0002439103260000092
Wherein,
Figure BDA0002439103260000093
the number of boundary points is taken as a proportion of the total number of data points. And selecting a corresponding number of points with larger boundary degree as boundary points.
The specific method of merging the micro-clusters is as follows. With ciFor example, it may form a neighbor cluster relationship with a plurality of micro-clusters, that is, ciThere are a plurality of adjacent clusters. In these adjacent clusters, there are ciForm homologous clusters, some and ciForming a non-homologous cluster. The main purpose is from ciCan be found in the adjacent cluster of (c)iConstructing micro-clusters of homologous clusters, and then merging ciAnd the nanoclusters. In fact, in the region far from the cluster core (in X)1Indicating) that non-homologous clusters are easier to construct between micro-clusters. In the region near the cluster core (by X)2Representing), homologous clusters are very easily formed among the micro clusters. So at X1And X2In both places, the homologous clusters are found in different ways. Suppose X1Consisting of multiple layers of boundary points, the three steps of micro-cluster merging are as follows.
1) And after deleting a layer of boundary points, determining whether the adjacent cluster relation is formed among all the micro clusters.
For many data sets, there is a tight cluster-to-cluster relationship. Then, the micro-clusters located on the cluster boundary are highly susceptible to adjacent cluster relationship. Once a neighbor cluster relationship is formed between two micro clusters in different clusters, the two micro clusters may further form a homologous cluster. Therefore, after deleting a layer of boundary points, determining the adjacent cluster relationship (the data points with higher boundary degree are taken as the boundary points according to a set proportion, and the set proportion is generally between 10% and 50% of the total number of the data points), so that the adjacent cluster relationship originally established between two micro clusters in the boundary area of the clusters is not established, and naturally, the two micro clusters cannot form a homologous cluster.
2) And searching for homologous clusters.
Find X1The homologous cluster in (1). The purpose of this stage is to find X1Homologous clusters of all micro-clusters in (a). Suppose X1Has M boundary points (X)1Is composed of layer-by-layer boundary points), some local central points in M boundary points are defined by lcjRepresenting a local central point thereof. With ciRepresentative and lcjCorresponding micro-cluster with cjDenotes ciOne of the adjacent clusters of (a). Then the invention needs to determine c before these M boundary points are deletediWho is the homologous cluster of (a). As previously mentioned, the closer the micro-clusters are to the center of the cluster, the more likely it is that homologous clusters will be formed. While the more micro-clusters near the center of the cluster, the fewer boundary points are contained. Therefore, with | cjI represents a micro-cluster c in an initial statejThe number of data points in the data set. With | cjtI denotes the micro-cluster c before M boundary points are deletedjThe number of data points in (where t represents the boundary point that has been previously deleted t times). Then it is determined that,
Figure BDA0002439103260000101
the larger the value of (A), the larger the description of cjThe closer to the center of the cluster, c is illustratedjAnd ciThe more likely it is that homologous clusters are formed. Furthermore, the closer the local center points of two micro-clusters are, the more likely the two micro-clusters are in the same cluster. The probability that two micro-clusters are homologous clusters is expressed in terms of binding degree, and the larger the value of the binding degree, the more likely the two micro-clusters are homologous clusters. The degree of binding was calculated as follows
Figure BDA0002439103260000102
Figure BDA0002439103260000103
Representing the sum between two local centre pointsThe distance in the form of the euler.
At ciIn the neighboring cluster of (A), if cjAnd ciThe highest degree of binding, then cjIs that ciThe homologous cluster of (a). After finding the homologous clusters of all local center points contained in the M boundary points, deleting the M boundary points. If there is no local center point in the M boundary points, the M boundary points are directly deleted, and then the next layer of boundary points are processed, until the boundary points of the specified times are deleted, the above process is stopped. If c isiIf there is no adjacent cluster, then c is considerediAnd ciAnd forming a homologous cluster.
Find X2The homologous cluster in (1). The above deleted boundary points are X1The data points in (1). The remaining points in the dataset are X2Point (2). With lciRepresents X2A local center point ofiIndicating the micro-cluster corresponding to this local center point. As previously mentioned, X2The micro-clusters in (a) are very likely to constitute homologous clusters. Therefore, for ciAll neighboring clusters of (1) are considered to be c and c as long as they have data points remainingiAre homologous clusters. Find X2The homologous cluster of the micro-cluster where all local center points are located.
Then, for X1And X2And combining two micro-clusters in the homologous cluster into one cluster.
3) Reallocating data points in an outlier cluster
After merging the micro-clusters in step 1) and step 2), if the number of data points in a certain cluster is less than the set proportion of the total number of data points, the cluster is considered to be an abnormal cluster, otherwise, the cluster is considered to be a non-abnormal cluster, the set proportion is 0% -5%, and 1% is selected in the embodiment. The reason for the occurrence of an anomalous cluster is either that the points themselves are anomalous points or that the points are relatively less closely related to the core portion of the cluster. For whatever reason, the data points in the abnormal cluster are rebinned. That is, for each data point in an abnormal cluster, it is assigned to the cluster where the data point closest to it is located (this cluster must be a non-abnormal cluster, otherwise, the cluster where the next data point closest to it is located is continuously found). And ending the whole micro-cluster merging process to obtain the final image clustering result.
The algorithm adopted in the image clustering process is named as follows: the direction of binding between micro-clusters is determined by The passage of boundary points (The direction of The between micro-clusters is determined by The passing of boundary points, DUMDPBP), and The steps of The algorithm can be summarized as follows:
Figure BDA0002439103260000111
Figure BDA0002439103260000121
device embodiment
The apparatus proposed in this embodiment, as shown in fig. 5, includes a processor and a memory, where a computer program operable on the processor is stored in the memory, and the processor implements the method of the above method embodiment when executing the computer program.
That is, the method in the above method embodiments should be understood that the flow of the image data clustering method may be implemented by computer program instructions. These computer program instructions may be provided to a processor such that execution of the instructions by the processor results in the implementation of the functions specified in the method flow described above.
The processor referred to in this embodiment refers to a processing device such as a microprocessor MCU or a programmable logic device FPGA;
the memory referred to in this embodiment includes a physical device for storing information, and generally, information is digitized and then stored in a medium using an electric, magnetic, optical, or the like. For example: various memories for storing information by using an electric energy mode, such as RAM, ROM and the like; various memories for storing information by magnetic energy, such as hard disk, floppy disk, magnetic tape, magnetic core memory, bubble memory, and U disk; various types of memory, CD or DVD, that store information optically. Of course, there are other ways of memory, such as quantum memory, graphene memory, and so forth.
The apparatus comprising the memory, the processor and the computer program is realized by the processor executing corresponding program instructions in the computer, and the processor can be loaded with various operating systems, such as windows operating system, linux system, android, iOS system, and the like.
As other embodiments, the device can also comprise a display, and the display is used for displaying the diagnosis result for the reference of workers.
Analysis of experiments
To better illustrate the effect of the present invention, the method of the present invention is compared with several existing clustering methods, and the experiment is performed on a processor with 764 bits windows operating system, Matlab 2016b software, 32GB memory, intel (r) core (tm) i7-8700 CPU @3.20 GHz. Experiments were performed using 3 different types of 2-dimensional datasets and one high-dimensional image dataset, as shown in table 1, table 1 describing the basic information of the datasets. Before the experiment begins, the data is normalized to eliminate the influence of different values on the experiment result.
TABLE 1
Figure BDA0002439103260000131
For the first three data sets, the clustering results of the DUMDPBP algorithm in the invention are shown in FIG. 3-a, FIG. 3-b and FIG. 3-c, and the clustering results are consistent with the actual classification results of the corresponding data sets, which shows that the DUMDPBP algorithm in the invention has good effects on Pathbased data sets, Flame data sets and Spiral data sets. Different shapes in the above figures represent different clusters, and the final clustering into three types is performed for Pathbased datasets, two types is performed for Flame datasets, and three types is performed for Spiral datasets.
For Olivetti faces datasets, because they are composed of a large number of photographs, rather than points one by one in a two-dimensional plane, and the number of clusters is as many as several tens, it is impossible to show their clustering results in different shapes as in the case of synthetic datasets. There is also no other way to illustrate a data set consisting of photographs with a large number of clusters. But in the subsequent section, the clustering result is shown by the evaluation index. Even so, we show the case of cluster center point in the clustering result. It should be noted that the Olivetti faces dataset has 400 photos, and only the photos of the first 100 persons are selected for presentation. FIG. 4-a is the clustering result of the DUMDPBP algorithm. The white dots in the upper right corner of the partial picture represent the selected local center points, and it can be seen from the figure that at least one white dot appears in each row (i.e. each cluster), that is, each cluster selects the local center point, so that each cluster cannot be ignored. Fig. 4-b is a clustering result of the DPC algorithm, and a white dot at the upper right corner in the picture represents a cluster center point selected by the DPC algorithm, and it can be seen from the figure that no cluster center point is selected in the 1 st cluster (the picture in the 1 st row), the 8 th cluster (the picture in the 8 th row), and the 10 th cluster (the picture in the 10 th row), that is, the DPC algorithm will definitely confuse data points in the three clusters with data points in other clusters. In fact, the DPC algorithm assigns all data points in the 3 rd cluster and 8 data points in the 10 th cluster to one cluster, which is a bad case. The comparison shows that the method has great advantages compared with the DPC algorithm.
In order to further know the good and bad of the clustering result of the DUMDPBP algorithm, the invention adopts three evaluation indexes commonly used by people to measure the performance of the clustering algorithm. The three evaluation indexes are: adjusting mutual information AMI, adjusting a landed coefficient ARI and a Fowles-Mallows index FMI. AMI is in the range of [ -1, 1 [)]The ARI value range is [ -1, 1 [ ]]The FMI value range is [0, 1 ]]. All three indexes are that the clustering result is consistent with the actual situation when the index value is larger. If U is equal to { U ═ U1,…,URV ═ V } and V ═ V1,…,VCDenotes the data set X ═ X, respectively1,x2,…,xnAnd (5) real division and clustering result division. Let H (U) denote the entropy of the original partition, H (V) denote the entropy of the partition of the clustering result, MI (U, V) denote the mutual information between the original partition U and the partition of the clustering result V, E {. cndot.) denote the original partition U and the partition of the clustering result VThe clustering results partition the expected mutual information between V. And a represents the condition that two data points are in the same cluster under the real condition and the clustering result is in the same cluster. And b represents the condition that the two data points are in the same cluster under the real condition and the clustering result is not in the same cluster. And c represents the condition that the two data points are not in the same cluster under the real condition and the clustering result is in the same cluster. D represents the condition that the two data points are not in the same cluster under the real condition, and the clustering result is not in the same cluster. The above three evaluation indexes are calculated as follows.
Figure BDA0002439103260000141
Figure BDA0002439103260000142
Figure BDA0002439103260000143
The algorithm (called DUMDPBP) of the present invention was compared with the existing 7 clustering algorithms on the above 4 data sets according to the above three criteria, and the comparison results are shown in Table 2.
TABLE 2
Figure BDA0002439103260000144
Figure BDA0002439103260000151
TABLE 3
Figure BDA0002439103260000152
In Table 2, the parameters of DUMDPBP on 4 data sets are shown in Table 3, including the number of neighbors of a data point (k), the number of times of finding homologous clusters (N1), the ratio of boundary points to the total number of data points when determining the relationship of the adjacent clusters (N2), finding homologous clustersThe proportion of boundary points to the total number of remaining data points when clustered (N3). For the Spiral dataset, as long as two micro-clusters form a neighbor cluster relationship, they are homologous clusters due to the obvious interval between clusters. So N is not required to search for homologous clusters1Set to 0; for the Flame dataset, N is due to the inconspicuous inter-cluster spacing2The arrangement is relatively large; for N3The value of (b) is usually less than 0.5. Adjusting parameter d adopted by clustering results of DPC algorithm on Olivetti faces data setc1.1125. Adjustment of the remaining parameters is all referred to the article published by the author Rui Liu et al in journal "Information Sciences": shared-near-based clustering by fast search and find of densitypeaks.
As can be seen from table 2, for a data set with any shape like Spiral, the values of the dumpbp and the other 5 clustering algorithms on three indexes are all 1, which indicates that the clustering results of the six clustering algorithms are completely the same as the real situation; for the data set with no obvious separation limit among the clusters like the Flame, the upper limit value 1 of the indexes is obtained on three indexes, namely DUMDPBP, FKNN-DPC and DPC, which indicates that the clustering result is completely consistent with the real situation; for the Pathbased dataset, only the values of the DUMDPBP algorithm on three indexes are all 1, while the effects of other algorithms are relatively poor; for Olivetti faces datasets, the DUMDPBP algorithm performs better than other algorithms in three indexes. Therefore, the algorithm of the invention not only has good clustering effect on the synthetic data set, but also has good clustering effect on the image data set in the real world. Therefore, the DUMDPBP algorithm provided by the invention is suitable for various types of data sets, and has a very good application prospect.
The invention combines mutual adjacency and weighted Euclidean distance, proposes a new density calculation mode, avoids artificially setting truncation distance; then, a strategy of self-recommending local central points is provided, and then the rest points are distributed, so that the local central points of the clusters with relatively low density can be selected, and the clusters with relatively low density cannot be ignored; and finally, searching the homologous clusters in the process of deleting boundary points layer by layer, combining the homologous clusters into clusters, and finishing a clustering task. Experiments show that the algorithm provided by the invention obtains better results on data sets with different distribution types, and the application field is wide.

Claims (10)

1. A clustering method of image data, characterized by comprising the steps of:
1) acquiring image data to be clustered, performing dimension reduction processing on the image data, taking each image as a data point, and determining the density of each data point;
2) for each data point, judging whether the density of the data point is greater than that of the data points in the neighborhood, if so, recommending the data point as a local central point, taking the local central point as a center, distributing the rest data points to the local central point, and generating a micro-cluster;
3) determining the number of boundary points according to a first set proportion, selecting a corresponding number of points with larger boundary degree as the boundary points, and determining whether an adjacent cluster relation is formed between two micro clusters under the condition of not considering the boundary points;
4) determining boundary points according to a second set proportion, judging whether the determined boundary points contain local central points, if so, selecting the micro-cluster with the highest combination degree from the adjacent clusters of the micro-cluster where the local central points are located for combination, and deleting the boundary points;
and continuously determining boundary points from the rest points again according to a second set proportion to obtain a next layer of boundary points, judging whether the next layer of boundary points contain local central points, if so, selecting the micro-cluster with the highest combination degree from the adjacent clusters of the micro-cluster with the local central points to combine, deleting the layer of boundary points until the boundary points with the set times are deleted, and then combining the micro-cluster with the adjacent clusters if data points exist in the adjacent clusters.
2. The method for clustering image data according to claim 1, wherein the density value of each point in step 1) is determined based on the mutual proximity of the points and the average of the distances between the point and each point in its neighborhood.
3. The method for clustering image data according to claim 2, wherein the density value is calculated by the formula:
Figure FDA0002439103250000011
Figure FDA0002439103250000012
where ρ (x)p) Represents the data point xpDensity of (a), MND (x)p) Represents the data point xpMutual adjacency of NG (x)p) Represents the data point xpIs in the neighborhood, | NG (x)p) | represents a data point xpThe number of data points in the neighborhood,
Figure FDA0002439103250000013
represents the data point xpTo data point xqWeighted Euclidean distance of (1), at data point xpOf k neighbors, if xqIs x of the data pointpThe first neighbor, thenpqIs that the larger the value of (A), xqTo xpThe farther away; if xpIs x of the data pointqThe first one of the neighbors is,qpthe value of (c) is.
4. The method for clustering image data according to claim 2, wherein the distance between two data points is a weighted euclidean distance, and the pearson correlation coefficient between two data points is used as a weight.
5. The method for clustering image data according to any one of claims 1 to 4, wherein the degree of boundary of the data points is calculated by the formula:
Figure FDA0002439103250000021
Figure FDA0002439103250000022
wherein bp (x)p) Represents the data point xpBoundary degree, | kNN (x)p) | represents a data point xpNumber of neighbors of, kNN (x)p) Represents the data point xpThe number of k neighbor points of (a),
Figure FDA0002439103250000023
is a data point xpAnd data point xqPearson correlation coefficient between, m is the dimensionality of the data set.
6. The method for clustering image data according to any one of claims 1 to 4, wherein the adjacent cluster relationship is determined by:
Figure FDA0002439103250000024
wherein NG (x)p) Represents the data point xpNeighborhood of ciDenotes the ith micro-cluster formed, R (c)i,cj) Represents a micro cluster ciAnd micro-cluster cjForm a neighboring cluster relationship.
7. The method for clustering image data according to claim 1, further comprising performing center point adjustment on the obtained micro-cluster, changing a local center point of the micro-cluster to a mean point of all data points in the micro-cluster, and if the mean point does not exist, changing to a data point in the micro-cluster closest to the mean point.
8. The method for clustering image data according to claim 1, wherein the degree of combination is calculated by the following formula:
Figure FDA0002439103250000031
|cji represents a micro-cluster c in an initial statejThe number of data points, | cjtI represents the micro-cluster c after deleting the boundary point of the t layerjThe number of the data points in the data stream,
Figure FDA0002439103250000032
the larger the value of (A), cjThe closer to the cluster core region, cjAnd ciThe higher the degree of bonding of (a),
Figure FDA0002439103250000033
representing a weighted euclidean distance between two local center points.
9. The method for clustering image data according to claim 1, further comprising a process of adjusting the clusters after the merging of the micro clusters:
and after merging the homologous clusters, judging whether the number of the data points in the obtained cluster is less than or equal to 0-5% of the total number of the data points, if so, taking the cluster as an abnormal cluster, and distributing the data points in the cluster to the non-abnormal cluster where the data points closest to the cluster are located.
10. An apparatus for clustering image data, comprising a memory and a processor, and a computer program stored in the memory and running on the processor, wherein the processor is coupled to the memory, and wherein the processor executes the computer program to implement the method for clustering image data according to any one of claims 1 to 9.
CN202010260470.7A 2020-04-03 2020-04-03 Image data clustering method and device Active CN111626321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010260470.7A CN111626321B (en) 2020-04-03 2020-04-03 Image data clustering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010260470.7A CN111626321B (en) 2020-04-03 2020-04-03 Image data clustering method and device

Publications (2)

Publication Number Publication Date
CN111626321A true CN111626321A (en) 2020-09-04
CN111626321B CN111626321B (en) 2023-06-06

Family

ID=72271810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010260470.7A Active CN111626321B (en) 2020-04-03 2020-04-03 Image data clustering method and device

Country Status (1)

Country Link
CN (1) CN111626321B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112907602A (en) * 2021-01-28 2021-06-04 中北大学 Three-dimensional scene point cloud segmentation method based on improved K-nearest neighbor algorithm
CN113344128A (en) * 2021-06-30 2021-09-03 福建师范大学 Micro-cluster-based industrial Internet of things adaptive stream clustering method and device
CN115858002A (en) * 2023-02-06 2023-03-28 湖南大学 Binary code similarity detection method and system based on graph comparison learning and storage medium
CN117152543A (en) * 2023-10-30 2023-12-01 山东浪潮科学研究院有限公司 Image classification method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000028441A2 (en) * 1998-11-11 2000-05-18 Microsoft Corporation A density-based indexing method for efficient execution of high-dimensional nearest-neighbor queries on large databases
CN105139035A (en) * 2015-08-31 2015-12-09 浙江工业大学 Mixed attribute data flow clustering method for automatically determining clustering center based on density
CN109409400A (en) * 2018-08-28 2019-03-01 西安电子科技大学 Merge density peaks clustering method, image segmentation system based on k nearest neighbor and multiclass
CN110929758A (en) * 2019-10-24 2020-03-27 河海大学 Complex data-oriented clustering algorithm for rapidly searching and finding density peak

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000028441A2 (en) * 1998-11-11 2000-05-18 Microsoft Corporation A density-based indexing method for efficient execution of high-dimensional nearest-neighbor queries on large databases
CN105139035A (en) * 2015-08-31 2015-12-09 浙江工业大学 Mixed attribute data flow clustering method for automatically determining clustering center based on density
CN109409400A (en) * 2018-08-28 2019-03-01 西安电子科技大学 Merge density peaks clustering method, image segmentation system based on k nearest neighbor and multiclass
CN110929758A (en) * 2019-10-24 2020-03-27 河海大学 Complex data-oriented clustering algorithm for rapidly searching and finding density peak

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112907602A (en) * 2021-01-28 2021-06-04 中北大学 Three-dimensional scene point cloud segmentation method based on improved K-nearest neighbor algorithm
CN113344128A (en) * 2021-06-30 2021-09-03 福建师范大学 Micro-cluster-based industrial Internet of things adaptive stream clustering method and device
CN113344128B (en) * 2021-06-30 2023-06-23 福建师范大学 Industrial Internet of things self-adaptive stream clustering method and device based on micro clusters
CN115858002A (en) * 2023-02-06 2023-03-28 湖南大学 Binary code similarity detection method and system based on graph comparison learning and storage medium
CN115858002B (en) * 2023-02-06 2023-04-25 湖南大学 Binary code similarity detection method and system based on graph comparison learning and storage medium
CN117152543A (en) * 2023-10-30 2023-12-01 山东浪潮科学研究院有限公司 Image classification method, device, equipment and storage medium
CN117152543B (en) * 2023-10-30 2024-06-07 山东浪潮科学研究院有限公司 Image classification method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111626321B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN111626321A (en) Image data clustering method and device
CN110309860A (en) The method classified based on grade malignancy of the convolutional neural networks to Lung neoplasm
CN106845536B (en) Parallel clustering method based on image scaling
CN106126918B (en) A kind of geographical space abnormal aggregation domain scanning statistical method based on interaction force
KR101866522B1 (en) Object clustering method for image segmentation
CN109492796A (en) A kind of Urban Spatial Morphology automatic Mesh Partition Method and system
CN114386466B (en) Parallel hybrid clustering method for candidate signal mining in pulsar search
CN109636809B (en) Image segmentation level selection method based on scale perception
CN102509119B (en) Method for processing image scene hierarchy and object occlusion based on classifier
CN107833224A (en) A kind of image partition method based on multi-level region synthesis
CN111709487A (en) Underwater multi-source acoustic image substrate classification method and system based on decision-level fusion
CN102902976A (en) Image scene classification method based on target and space relationship characteristics
Bruzzese et al. DESPOTA: DEndrogram slicing through a pemutation test approach
CN113988198B (en) Multi-scale city function classification method based on landmark constraint
CN114638301A (en) Density peak value clustering algorithm based on density similarity
CN113420738B (en) Self-adaptive network remote sensing image classification method, computer equipment and storage medium
Li et al. A new density peak clustering algorithm based on cluster fusion strategy
CN110097636B (en) Site selection planning method based on visual field analysis
CN117056761A (en) Customer subdivision method based on X-DBSCAN algorithm
CN103336781B (en) A kind of medical image clustering method
CN113205124B (en) Clustering method, system and storage medium based on density peak value under high-dimensional real scene
CN116578893A (en) Clustering integration system and method for self-adaptive density peak value
CN115510959A (en) Density peak value clustering method based on natural nearest neighbor and multi-cluster combination
Hu et al. A cluster validity index for fuzzy c-means clustering
Saha et al. Unsupervised pixel classification in satellite imagery using a new multiobjective symmetry based clustering approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant