CN101777126A - Clustering method for multidimensional characteristic vectors - Google Patents

Clustering method for multidimensional characteristic vectors Download PDF

Info

Publication number
CN101777126A
CN101777126A CN 201010114138 CN201010114138A CN101777126A CN 101777126 A CN101777126 A CN 101777126A CN 201010114138 CN201010114138 CN 201010114138 CN 201010114138 A CN201010114138 A CN 201010114138A CN 101777126 A CN101777126 A CN 101777126A
Authority
CN
China
Prior art keywords
seeds
proper vector
new
data
classification logotype
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201010114138
Other languages
Chinese (zh)
Inventor
黄锐
桑农
唐奇伶
高俊
高常鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN 201010114138 priority Critical patent/CN101777126A/en
Publication of CN101777126A publication Critical patent/CN101777126A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a clustering method for multidimensional characteristic vectors. Based on the observation of characteristic space data distribution, the method of the invention starts with high density area data which can be clustered most easily and generate space consistency results, and provides a clustering method under an incremental-iterative mode. Each step of iterative selects data with higher density as a seed set, and through the seed growing process, the data is organized, so that each step of clustering in the iterative process is finished on the data with the relatively highest density. The result shows that the clustering method of the invention can generate good results which can not be obtained by the classical clustering algorithm.

Description

A kind of clustering method of multidimensional characteristic vectors
Technical field
The invention belongs to area of pattern recognition, be specifically related to a kind of method the multidimensional characteristic vectors cluster.
Background technology
It is an important information processing means in the fields such as pattern-recognition, computer vision, data mining that sample data in the feature space is carried out cluster.Data are carried out after the cluster, not only can reduce the data volume of required processing, from cluster result, also can find the similarity rule between the data simultaneously.The good clustering method of robustness should be able to be partitioned into the data point in the feature space some disjoint subclass (each subclass is considered as a class), the distance that belongs between the data point in the same subclass (class) is as much as possible little, and the distance that belongs between the data point of different subclass (class) is big as much as possible.The present invention is called Space Consistency (spatiallycoherent) with above-mentioned robustness good properties.
At present, classical clustering method has, (the reference: J.MacQueen of K-means clustering algorithm, " SomeMethods for Classification and Analysis of Multivariate Observations ", Proc.Fifth Berkeley Symp.Math., Statistics, and Probability, 1967:281-297), (the reference: J.Shi and J.Malik of Normalized Cut clustering algorithm, " Normalized cuts andimage segmentation ", IEEE Trans.Pattern Anal.Mach.Intell., 2000,22 (8): 888-905), and (the reference: D.Comaniciu and P.Meer of average drifting clustering algorithm, " Meanshift:A robust approach toward feature space analysis ", IEEE Trans.PatternAnal.Mach.Intell., 2002,24 (5): 603-619) etc.Usually, given proper vector to be clustered when using K-means clustering algorithm and Normalized Cut clustering algorithm, by specifying the classification number of wishing generation, can obtain a cluster result.And when using the average drifting clustering algorithm, need to specify a feature bandwidth parameter (feature bandwidth), by this parameter, the average drifting clustering algorithm is estimated the continuous local high-density region of seeking in the feature space by non-parametric density, and the proper vector that will belong in certain local high-density region is classified as same classification.If data to be clustered present the bulk distribution (scattering blob-like distribution) of dispersion in feature space, be that each data point all is distributed in some high-density regions (group), and data point very sparse (it is discontinuous to occur density between the high-density region) between these high-density regions, in this case, above-mentioned classical clustering algorithm can be exported a cluster result with Space Consistency (in fact, have the data itself that the bulk of dispersion distributes and had Space Consistency) effectively.Yet in actual applications, data acquisition to be clustered often is not that the bulk that presents dispersion distributes in feature space, and for example, in computer vision field, the proper vector of extracting from image often presents complicated flow pattern and distributes.Directly the data that these present complex distributions are carried out cluster, often can't obtain having the result of Space Consistency with above-mentioned classical clustering method.A most important reason is exactly, present in the real data of complex distributions at these, still there is the lower data point of some relative densities in the tangible border of neither one often between the high-density region between the high-density region, it is enough sparse that these low-density data points do not reach.At present, in pattern classification and machine learning field, people reach common understanding for the research of cluster, that is, the uncertainty of cluster result often appears on the data point of density regions in the feature space.
In addition, if when proper vector is difficult to obtain good cluster result in original feature space, proper vector is transformed into carries out cluster a kind of good idea of can yet be regarded as on the new feature space of another one again.In the prior art a kind of semi-supervised differentiation algorithm (reference: D.Cai has appearred at present, X.He, and J.Han. " Semi-supervised discriminant analysis; in Proc.IEEE Int.Conf.Computer Vision, Rio de Janeiro, Brazil ", Jun.2007.), after adopting this algorithm that the primitive character space is changed, proper vector often has good separability in new feature space, and this is with highly beneficial and cluster proper vector.
Summary of the invention
The object of the present invention is to provide a kind of clustering method of multidimensional characteristic vectors, the cluster result that is obtained by this method has more Space Consistency, i.e. cluster result robust more.Thereby, the affiliated situation of classification between the proper vector is more objectively described.
Multidimensional characteristic vectors clustering method provided by the invention the steps include:
(1) m proper vector to be clustered is designated as proper vector set X={x 1, x 2..., x m, x wherein iBe a proper vector, i=1 ..., m;
(2) above-mentioned proper vector set X is set up a k-neighbour and scheme G k, wherein, adopt Euclidean distance ‖ x i-x j2Measure any two the proper vector x among the X iAnd x jBetween the far and near relation of distance;
(3) obtain the middle k-neighbour of step (2) and scheme G kAdjacency matrix A, each elements A of matrix A wherein IjCalculate by following formula (1):
Figure GSA00000045056400031
Aff (ij)Be proper vector x iWith proper vector x jIn abutting connection with degree, N k(x j) the expression vector x jK neighbours,
N k(x i) the expression vector x iK neighbour, wherein aff (ij)Calculate by following formula (2):
aff ( ij ) = exp { - | | x i - x j | | 2 2 σ 2 } - - - ( 2 )
Wherein σ is a constant;
(4) calculate X={x 1, x 2..., x mIn the density d en (x of each proper vector i), computing formula is as follows:
den ( x i ) = Σ j = 1 m A ij - - - ( 3 )
Choose all proper vector density { den (x i) I=1..., the 96th hundredths (96 of m Th-percentile) density value is designated as threshold value T 96
(5) obtain seed set X Seeds, X wherein Seeds={ x i| den (x i)>T 96, x i∈ X};
(6) utilize the average drifting algorithm to current seed set X SeedsCarry out cluster, obtain the classification logotype set L of current seed set Seeds, L wherein SeedsIn each element for belonging to current seed set X SeedsIn the classification logotype of proper vector, distinguish with natural number usually;
(7) to current seed set X SeedsCarry out the increment iterative cluster:
At first, the k-neighbour from step (2) schemes G kIn choose current seed set X SeedsAll k neighbour data Δ X, it is defined as Δ X={x i| x i∈ N k(x j), or, x j∈ N k(x i), x wherein j∈ X Seeds, with current seed set X SeedsMerge into a new seed set with Δ X, be designated as
Secondly, with new seed set X Seeds NewIn current seed set X SeedsBe considered as having the data of classification logotype, Δ X is considered as not having the data of classification logotype, to described X Seeds NewUse semi-supervised discriminant analysis method and obtain X Seeds NewThe basic U of an optimum projector space Opt, and with X Seeds NewProject to U OptIn the space that generates, the data after the projection are:
Figure GSA00000045056400044
U wherein Opt TThe sign matrix U OptTransposition;
Then, utilize the average drifting algorithm to X Seeds SDACarry out cluster, resulting classification logotype is composed to X Seeds NewMiddle corresponding data, X Seeds NewClassification logotype set be designated as L Seeds New, and upgrade X SeedsAnd L SeedsEven,
Figure GSA00000045056400045
Figure GSA00000045056400046
The circulation said process, until Circulation stops, the current seed set X that obtains upgrading SeedsWith cluster result L Seeds, wherein
Figure GSA00000045056400052
Be empty set.
(8) classification logotype that obtains X is gathered L, finishes cluster:
If
Figure GSA00000045056400053
So with { X-X SeedsIn data give a new classification logotype l jointly Rest, the classification logotype set
Figure GSA00000045056400054
N ∈ 1 ..., m}, wherein
Figure GSA00000045056400055
Otherwise, classification logotype set L=L Seeds
Description of drawings
Fig. 1 is the process flow diagram of the inventive method;
Fig. 2 is 1071 three proper vectors, and wherein the sign of 3 kinds of colors has been represented 3 classifications, and the cluster result that effective clustering algorithm obtains should be consistent with the distribution of three classifications among Fig. 2.
Fig. 3 is the cluster result of K-means algorithm.
The cluster result of Fig. 4 Normalized Cut algorithm.
The cluster result of Fig. 5 average drifting algorithm.
The cluster result of clustering algorithm among Fig. 6 the present invention.
Embodiment
The present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
In the present embodiment, as shown in Figure 1, detailed process is:
(1) m proper vector to be clustered is designated as proper vector set X={x 1, x 2..., x m, x wherein iBe a proper vector, i=1 ..., m.Usually the span of the number m of proper vector is 10 2~10 3On the order of magnitude.Shown in Fig. 2 is m=1071, x iSpatial distribution map when being one 3 dimensional feature vector.
(2) proper vector set X is set up a k-neighbour and scheme G k, the k value is generally 5-7, and the k value is 7 in the present embodiment.Setting up figure G kThe time, for any two the proper vector x among the X iAnd x j, adopt Euclidean distance ‖ x i-x j2Measure the far and near relation of distance of two-value.
(3) obtain the middle k-neighbour of step (2) and scheme G kAdjacency matrix A.Each elements A of matrix A wherein IjCalculate by formula (1):
Figure GSA00000045056400061
Aff (ij)Be proper vector x iWith proper vector x jIn abutting connection with degree, N k(x j) the expression vector x jK neighbours, N k(x i) the expression vector x iK neighbours.Aff wherein (ij)Calculate by formula (2):
aff ( ij ) = exp { - | | x i - x j | | 2 2 σ 2 } - - - ( 2 )
σ is a constant, and its value is different and different according to the type of proper vector, and its span is than little two orders of magnitude of span of feature vector usually.In the present embodiment, proper vector x iSpan is 0-255, and we choose σ=3.
(4) calculate X={x 1, x 2..., x mIn the density d en (x of each proper vector i), computing formula is as follows:
den ( x i ) = Σ j = 1 m A ij - - - ( 3 )
Choose all proper vector density { den (x i) I=1 ..., mThe 96th hundredths (96 Th-percentile) density value is designated as threshold value T 96
(5) choose seed set X Seeds, X wherein Seeds={ x i| den (x i)>T 96, x i∈ X}.
(6) utilize (reference: D.Comaniciu and P.Meer of average drifting algorithm, " Mean shift:A robustapproach toward feature space analysis ", IEEE Trans.Pattern Anal.Mach.Intell., 2002,24 (5): 603-619) to current seed set X SeedsCarry out cluster, obtain the classification logotype set L of current seed set Seeds, L wherein SeedsIn each element for belonging to current seed set X SeedsIn the classification logotype of proper vector, distinguish with natural number usually.
(7) increment iterative cluster:
At first, from the k-neighbour G of step (2) kIn choose current seed set X SeedsAll k neighbour data, it is defined as Δ X={x i| x i∈ N k(x j), or, x j∈ N k(x i), x wherein j∈ X Seeds, with the current seed set X that has had classification logotype SeedsMerge into a new seed set with the Δ X that does not have classification logotype, be designated as
Figure GSA00000045056400071
Secondly, to X Seeds NewUse semi-supervised discriminant analysis method (reference: D.Cai, X.He, and J.Han. " Semi-supervised discriminant analysis; in Proc.IEEE Int.Conf.Computer Vision, Rio de Janeiro, Brazil ", Jun.2007.), obtain X Seeds NewThe basic U of an optimum projector space OptAnd with X Seeds NewProject to U OptIn the space that generates, the data after the projection are:
Figure GSA00000045056400072
U wherein Opt TThe sign matrix U OptTransposition.
Then, utilize the average drifting algorithm to X Seeds SDACarry out cluster, resulting classification logotype is composed to X Seeds NewMiddle corresponding data.With X Seeds NewClassification logotype set be designated as L Seeds New
Upgrade X SeedsAnd L Seeds, order X seeds = X seeds new , L seeds = L seeds new .
The circulation said process, until Circulation stops, and obtains final seed set X SeedsWith cluster result L Seeds
(8) classification logotype that obtains X is gathered L, finishes cluster: if
Figure GSA00000045056400076
So with { X-X SeedsIn data give a new classification logotype l jointly Rest, the classification logotype set
Figure GSA00000045056400077
N ∈ 1 ..., m}, wherein
Figure GSA00000045056400078
Otherwise, classification logotype set L=L Seeds
Fig. 6 is the cluster result that is obtained by algorithm of the present invention, Fig. 3-the 5th, other 3 kinds of cluster results that classical clustering algorithm obtains, of the present invention as can be seen in result and Fig. 2 originally category distribution more consistent, thereby validity of the present invention has been described.
The numerical range of the concrete proper vector of basis is chosen feature bandwidth (feature bandwidth) the parameter h in the average drifting algorithm among the present invention r, its span is than the little order of magnitude of span of feature vector usually.In the above-described embodiments, the feature bandwidth parameter h that chooses r=10.5.
According to an exemplary embodiment of the present invention, be used to realize that computer system of the present invention can comprise, particularly, central processing unit (CPU), storer and I/O (I/O) interface.Computer system usually by I/O interface and display with link to each other such as this type of various input equipments of mouse and keyboard, support circuit can comprise the fast buffer memory of image height, power supply, clock circuit and the such circuit of communication bus.Storer can comprise random access memory (RAM), ROM (read-only memory) (ROM), disc driver, magnetic tape station etc., or their combination.Computer platform also comprises operating system and micro-instruction code.Various process described herein and function can be by the micro-instruction code of operating system execution or the part of application program (or their combination).In addition, various other peripherals can be connected to this computer platform, as additional data storage device and printing device.
Should also be understood that and so the actual connection between the system component (or process steps) may be different, specifically decide on programming mode of the present invention because the assembly and the method step of some construction system described in the accompanying drawing can form of software be realized.Based on the principle of the invention that proposes herein, the ordinary skill of association area it is contemplated that these and similar embodiment or configuration of the present invention.

Claims (1)

1. the clustering method of a multidimensional characteristic vectors comprises the steps:
(1) m proper vector to be clustered is designated as proper vector set X={x 1, x 2..., x m, x wherein iBe a proper vector, i=1 ..., m;
(2) above-mentioned proper vector set X is set up a k-neighbour and scheme G k, wherein, adopt Euclidean distance ‖ x i-x j2Measure any two the proper vector x among the X iAnd x jBetween the far and near relation of distance;
(3) obtain the middle k-neighbour of step (2) and scheme G kAdjacency matrix A, each elements A of matrix A wherein IjCalculate by following formula (1):
Figure FSA00000045056300011
Aff (ij)Be proper vector x iWith proper vector x jIn abutting connection with degree, N k(x j) the expression vector x jK neighbours, N k(x i) the expression vector x iK neighbour, wherein aff (ij)Calculate by following formula (2):
aff ( ij ) = exp { - | | x i - x j | | 2 2 σ 2 } - - - ( 2 )
Wherein σ is a constant;
(4) calculate X={x 1, x 2..., x mIn the density d en (x of each proper vector i), computing formula is as follows:
den ( x i ) = Σ j = 1 m A ij - - - ( 3 )
Choose all proper vector density { den (x i) I=1 ..., mThe 96th hundredths (96 Th-percentile) density value is designated as threshold value T 96
(5) obtain seed set X Seeds, X wherein Seeds={ x i| den (x i)>T 96, x i∈ X};
(6) utilize the average drifting algorithm to current seed set X SeedsCarry out cluster, obtain the classification logotype set L of current seed set Seeds, L wherein SeedsIn each element for belonging to current seed set X SeedsIn the classification logotype of proper vector, distinguish with natural number usually;
(7) to current seed set X SeedsCarry out the increment iterative cluster:
At first, the k-neighbour from step (2) schemes G kIn choose current seed set X SeedsAll k neighbour data Δ X, it is defined as Δ X={x i| x i∈ N k(x j), or, x j∈ N k(x i), x wherein j∈ X Seeds, with current seed set X SeedsMerge into a new seed set with Δ X, be designated as
Figure FSA00000045056300021
Secondly, with new seed set X Seeds NewIn current seed set x SeedsBe considered as having the data of classification logotype, Δ X is considered as not having the data of classification logotype, to described X Seeds NewUse semi-supervised discriminant analysis method and obtain X Seeds NewThe basic U of an optimum projector space Opt, and with X Seeds NewProject to U OptIn the space that generates, the data after the projection are: U wherein Opt TThe sign matrix U OptTransposition;
Then, utilize the average drifting algorithm to X Seeds SDACarry out cluster, resulting classification logotype is composed to X Seeds NewMiddle corresponding data, X Seeds NewClassification logotype set be designated as L Seeds New, upgrade X again SeedsAnd L SeedsEven,
Figure FSA00000045056300023
Figure FSA00000045056300024
The circulation said process, until
Figure FSA00000045056300025
Circulation stops, the current seed set X that obtains upgrading SeedsWith cluster result L Seeds, wherein
Figure FSA00000045056300026
Be empty set.
(8) classification logotype that obtains X is gathered L, finishes cluster:
If
Figure FSA00000045056300027
So with { X-X SeedsIn data give a new classification logotype l jointly Rest, the classification logotype set
Figure FSA00000045056300028
N ∈ 1 ..., m}, wherein
Figure FSA00000045056300029
Otherwise, classification logotype set L=L Seeds
CN 201010114138 2010-02-10 2010-02-10 Clustering method for multidimensional characteristic vectors Pending CN101777126A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010114138 CN101777126A (en) 2010-02-10 2010-02-10 Clustering method for multidimensional characteristic vectors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010114138 CN101777126A (en) 2010-02-10 2010-02-10 Clustering method for multidimensional characteristic vectors

Publications (1)

Publication Number Publication Date
CN101777126A true CN101777126A (en) 2010-07-14

Family

ID=42513584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010114138 Pending CN101777126A (en) 2010-02-10 2010-02-10 Clustering method for multidimensional characteristic vectors

Country Status (1)

Country Link
CN (1) CN101777126A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543536A (en) * 2018-10-23 2019-03-29 北京市商汤科技开发有限公司 Image identification method and device, electronic equipment and storage medium
CN111563467A (en) * 2020-05-13 2020-08-21 金陵科技学院 Solar panel cleaning system based on machine vision
CN111652148A (en) * 2020-06-04 2020-09-11 航天科工智慧产业发展有限公司 Face recognition method and device and electronic equipment
CN113850281A (en) * 2021-02-05 2021-12-28 天翼智慧家庭科技有限公司 Data processing method and device based on MEANSHIFT optimization
CN116628536A (en) * 2023-07-26 2023-08-22 杭州易靓好车互联网科技有限公司 Online transaction data processing system of automobile

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《IEEE》 20081231 Rui Huang 等 Approximation of Salient Contours in cluttered scenes 全文 1 , 2 *
《IEEE》 20091231 Rui Huang 等 Segmentation via Incremental Transductive Learning 第II-III部分 1 , 2 *
《计算机应用》 20050331 闫成新等 基于图划分的图像直方图聚类分割 全文 1 第25卷, 第3期 2 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543536A (en) * 2018-10-23 2019-03-29 北京市商汤科技开发有限公司 Image identification method and device, electronic equipment and storage medium
CN109543536B (en) * 2018-10-23 2020-11-10 北京市商汤科技开发有限公司 Image identification method and device, electronic equipment and storage medium
CN111563467A (en) * 2020-05-13 2020-08-21 金陵科技学院 Solar panel cleaning system based on machine vision
CN111652148A (en) * 2020-06-04 2020-09-11 航天科工智慧产业发展有限公司 Face recognition method and device and electronic equipment
CN113850281A (en) * 2021-02-05 2021-12-28 天翼智慧家庭科技有限公司 Data processing method and device based on MEANSHIFT optimization
CN113850281B (en) * 2021-02-05 2024-03-12 天翼数字生活科技有限公司 MEANSHIFT optimization-based data processing method and device
CN116628536A (en) * 2023-07-26 2023-08-22 杭州易靓好车互联网科技有限公司 Online transaction data processing system of automobile
CN116628536B (en) * 2023-07-26 2023-10-31 杭州易靓好车互联网科技有限公司 Online transaction data processing system of automobile

Similar Documents

Publication Publication Date Title
İnkaya et al. Ant colony optimization based clustering methodology
Ghamisi et al. Multilevel image segmentation based on fractional-order Darwinian particle swarm optimization
Xiang et al. Superpixel generating algorithm based on pixel intensity and location similarity for SAR image classification
dos Santos et al. A relevance feedback method based on genetic programming for classification of remote sensing images
Kiang et al. An evaluation of self-organizing map networks as a robust alternative to factor analysis in data mining applications
Bue et al. Automated classification of landforms on Mars
CN104346620A (en) Inputted image pixel classification method and device, and image processing system
CN103985112B (en) Image segmentation method based on improved multi-objective particle swarm optimization and clustering
CN111723815B (en) Model training method, image processing device, computer system and medium
JP6863926B2 (en) Data analysis system and data analysis method
CN101777126A (en) Clustering method for multidimensional characteristic vectors
CN112669298A (en) Foundation cloud image cloud detection method based on model self-training
Ma et al. A new clustering algorithm based on a radar scanning strategy with applications to machine learning data
Seiler et al. A collection of deep learning-based feature-free approaches for characterizing single-objective continuous fitness landscapes
CN106022359A (en) Fuzzy entropy space clustering analysis method based on orderly information entropy
CN113436223B (en) Point cloud data segmentation method and device, computer equipment and storage medium
Dammak et al. Histogram of dense subgraphs for image representation
CN113553442A (en) Unsupervised event knowledge graph construction method and system
Park et al. Seed growing for interactive image segmentation using SVM classification with geodesic distance
Beilschmidt et al. A linear-time algorithm for the aggregation and visualization of big spatial point data
Guo et al. Building fuzzy areal geographical objects from point sets
Sigut et al. Automatic marker generation for watershed segmentation of natural images
CN113486879A (en) Image area suggestion frame detection method, device, equipment and storage medium
Singh et al. Adaptive multiscale feature extraction in a distributed system for semantic classification of airborne LiDAR point clouds
Brodić et al. Classification of the scripts in medieval documents from Balkan region by run-length texture analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20100714