CN101777126A - Clustering method for multidimensional characteristic vectors - Google Patents
Clustering method for multidimensional characteristic vectors Download PDFInfo
- Publication number
- CN101777126A CN101777126A CN 201010114138 CN201010114138A CN101777126A CN 101777126 A CN101777126 A CN 101777126A CN 201010114138 CN201010114138 CN 201010114138 CN 201010114138 A CN201010114138 A CN 201010114138A CN 101777126 A CN101777126 A CN 101777126A
- Authority
- CN
- China
- Prior art keywords
- seeds
- proper vector
- new
- data
- classification logotype
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention discloses a clustering method for multidimensional characteristic vectors. Based on the observation of characteristic space data distribution, the method of the invention starts with high density area data which can be clustered most easily and generate space consistency results, and provides a clustering method under an incremental-iterative mode. Each step of iterative selects data with higher density as a seed set, and through the seed growing process, the data is organized, so that each step of clustering in the iterative process is finished on the data with the relatively highest density. The result shows that the clustering method of the invention can generate good results which can not be obtained by the classical clustering algorithm.
Description
Technical field
The invention belongs to area of pattern recognition, be specifically related to a kind of method the multidimensional characteristic vectors cluster.
Background technology
It is an important information processing means in the fields such as pattern-recognition, computer vision, data mining that sample data in the feature space is carried out cluster.Data are carried out after the cluster, not only can reduce the data volume of required processing, from cluster result, also can find the similarity rule between the data simultaneously.The good clustering method of robustness should be able to be partitioned into the data point in the feature space some disjoint subclass (each subclass is considered as a class), the distance that belongs between the data point in the same subclass (class) is as much as possible little, and the distance that belongs between the data point of different subclass (class) is big as much as possible.The present invention is called Space Consistency (spatiallycoherent) with above-mentioned robustness good properties.
At present, classical clustering method has, (the reference: J.MacQueen of K-means clustering algorithm, " SomeMethods for Classification and Analysis of Multivariate Observations ", Proc.Fifth Berkeley Symp.Math., Statistics, and Probability, 1967:281-297), (the reference: J.Shi and J.Malik of Normalized Cut clustering algorithm, " Normalized cuts andimage segmentation ", IEEE Trans.Pattern Anal.Mach.Intell., 2000,22 (8): 888-905), and (the reference: D.Comaniciu and P.Meer of average drifting clustering algorithm, " Meanshift:A robust approach toward feature space analysis ", IEEE Trans.PatternAnal.Mach.Intell., 2002,24 (5): 603-619) etc.Usually, given proper vector to be clustered when using K-means clustering algorithm and Normalized Cut clustering algorithm, by specifying the classification number of wishing generation, can obtain a cluster result.And when using the average drifting clustering algorithm, need to specify a feature bandwidth parameter (feature bandwidth), by this parameter, the average drifting clustering algorithm is estimated the continuous local high-density region of seeking in the feature space by non-parametric density, and the proper vector that will belong in certain local high-density region is classified as same classification.If data to be clustered present the bulk distribution (scattering blob-like distribution) of dispersion in feature space, be that each data point all is distributed in some high-density regions (group), and data point very sparse (it is discontinuous to occur density between the high-density region) between these high-density regions, in this case, above-mentioned classical clustering algorithm can be exported a cluster result with Space Consistency (in fact, have the data itself that the bulk of dispersion distributes and had Space Consistency) effectively.Yet in actual applications, data acquisition to be clustered often is not that the bulk that presents dispersion distributes in feature space, and for example, in computer vision field, the proper vector of extracting from image often presents complicated flow pattern and distributes.Directly the data that these present complex distributions are carried out cluster, often can't obtain having the result of Space Consistency with above-mentioned classical clustering method.A most important reason is exactly, present in the real data of complex distributions at these, still there is the lower data point of some relative densities in the tangible border of neither one often between the high-density region between the high-density region, it is enough sparse that these low-density data points do not reach.At present, in pattern classification and machine learning field, people reach common understanding for the research of cluster, that is, the uncertainty of cluster result often appears on the data point of density regions in the feature space.
In addition, if when proper vector is difficult to obtain good cluster result in original feature space, proper vector is transformed into carries out cluster a kind of good idea of can yet be regarded as on the new feature space of another one again.In the prior art a kind of semi-supervised differentiation algorithm (reference: D.Cai has appearred at present, X.He, and J.Han. " Semi-supervised discriminant analysis; in Proc.IEEE Int.Conf.Computer Vision, Rio de Janeiro, Brazil ", Jun.2007.), after adopting this algorithm that the primitive character space is changed, proper vector often has good separability in new feature space, and this is with highly beneficial and cluster proper vector.
Summary of the invention
The object of the present invention is to provide a kind of clustering method of multidimensional characteristic vectors, the cluster result that is obtained by this method has more Space Consistency, i.e. cluster result robust more.Thereby, the affiliated situation of classification between the proper vector is more objectively described.
Multidimensional characteristic vectors clustering method provided by the invention the steps include:
(1) m proper vector to be clustered is designated as proper vector set X={x
1, x
2..., x
m, x wherein
iBe a proper vector, i=1 ..., m;
(2) above-mentioned proper vector set X is set up a k-neighbour and scheme G
k, wherein, adopt Euclidean distance ‖ x
i-x
j‖
2Measure any two the proper vector x among the X
iAnd x
jBetween the far and near relation of distance;
(3) obtain the middle k-neighbour of step (2) and scheme G
kAdjacency matrix A, each elements A of matrix A wherein
IjCalculate by following formula (1):
Aff
(ij)Be proper vector x
iWith proper vector x
jIn abutting connection with degree, N
k(x
j) the expression vector x
jK neighbours,
N
k(x
i) the expression vector x
iK neighbour, wherein aff
(ij)Calculate by following formula (2):
Wherein σ is a constant;
(4) calculate X={x
1, x
2..., x
mIn the density d en (x of each proper vector
i), computing formula is as follows:
Choose all proper vector density { den (x
i)
I=1..., the 96th hundredths (96 of m
Th-percentile) density value is designated as threshold value T
96
(5) obtain seed set X
Seeds, X wherein
Seeds={ x
i| den (x
i)>T
96, x
i∈ X};
(6) utilize the average drifting algorithm to current seed set X
SeedsCarry out cluster, obtain the classification logotype set L of current seed set
Seeds, L wherein
SeedsIn each element for belonging to current seed set X
SeedsIn the classification logotype of proper vector, distinguish with natural number usually;
(7) to current seed set X
SeedsCarry out the increment iterative cluster:
At first, the k-neighbour from step (2) schemes G
kIn choose current seed set X
SeedsAll k neighbour data Δ X, it is defined as Δ X={x
i| x
i∈ N
k(x
j), or, x
j∈ N
k(x
i), x wherein
j∈ X
Seeds, with current seed set X
SeedsMerge into a new seed set with Δ X, be designated as
Secondly, with new seed set X
Seeds NewIn current seed set X
SeedsBe considered as having the data of classification logotype, Δ X is considered as not having the data of classification logotype, to described X
Seeds NewUse semi-supervised discriminant analysis method and obtain X
Seeds NewThe basic U of an optimum projector space
Opt, and with X
Seeds NewProject to U
OptIn the space that generates, the data after the projection are:
U wherein
Opt TThe sign matrix U
OptTransposition;
Then, utilize the average drifting algorithm to X
Seeds SDACarry out cluster, resulting classification logotype is composed to X
Seeds NewMiddle corresponding data, X
Seeds NewClassification logotype set be designated as L
Seeds New, and upgrade X
SeedsAnd L
SeedsEven,
The circulation said process, until
Circulation stops, the current seed set X that obtains upgrading
SeedsWith cluster result L
Seeds, wherein
Be empty set.
(8) classification logotype that obtains X is gathered L, finishes cluster:
Description of drawings
Fig. 1 is the process flow diagram of the inventive method;
Fig. 2 is 1071 three proper vectors, and wherein the sign of 3 kinds of colors has been represented 3 classifications, and the cluster result that effective clustering algorithm obtains should be consistent with the distribution of three classifications among Fig. 2.
Fig. 3 is the cluster result of K-means algorithm.
The cluster result of Fig. 4 Normalized Cut algorithm.
The cluster result of Fig. 5 average drifting algorithm.
The cluster result of clustering algorithm among Fig. 6 the present invention.
Embodiment
The present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
In the present embodiment, as shown in Figure 1, detailed process is:
(1) m proper vector to be clustered is designated as proper vector set X={x
1, x
2..., x
m, x wherein
iBe a proper vector, i=1 ..., m.Usually the span of the number m of proper vector is 10
2~10
3On the order of magnitude.Shown in Fig. 2 is m=1071, x
iSpatial distribution map when being one 3 dimensional feature vector.
(2) proper vector set X is set up a k-neighbour and scheme G
k, the k value is generally 5-7, and the k value is 7 in the present embodiment.Setting up figure G
kThe time, for any two the proper vector x among the X
iAnd x
j, adopt Euclidean distance ‖ x
i-x
j‖
2Measure the far and near relation of distance of two-value.
(3) obtain the middle k-neighbour of step (2) and scheme G
kAdjacency matrix A.Each elements A of matrix A wherein
IjCalculate by formula (1):
Aff
(ij)Be proper vector x
iWith proper vector x
jIn abutting connection with degree, N
k(x
j) the expression vector x
jK neighbours, N
k(x
i) the expression vector x
iK neighbours.Aff wherein
(ij)Calculate by formula (2):
σ is a constant, and its value is different and different according to the type of proper vector, and its span is than little two orders of magnitude of span of feature vector usually.In the present embodiment, proper vector x
iSpan is 0-255, and we choose σ=3.
(4) calculate X={x
1, x
2..., x
mIn the density d en (x of each proper vector
i), computing formula is as follows:
Choose all proper vector density { den (x
i)
I=1 ..., mThe 96th hundredths (96
Th-percentile) density value is designated as threshold value T
96
(5) choose seed set X
Seeds, X wherein
Seeds={ x
i| den (x
i)>T
96, x
i∈ X}.
(6) utilize (reference: D.Comaniciu and P.Meer of average drifting algorithm, " Mean shift:A robustapproach toward feature space analysis ", IEEE Trans.Pattern Anal.Mach.Intell., 2002,24 (5): 603-619) to current seed set X
SeedsCarry out cluster, obtain the classification logotype set L of current seed set
Seeds, L wherein
SeedsIn each element for belonging to current seed set X
SeedsIn the classification logotype of proper vector, distinguish with natural number usually.
(7) increment iterative cluster:
At first, from the k-neighbour G of step (2)
kIn choose current seed set X
SeedsAll k neighbour data, it is defined as Δ X={x
i| x
i∈ N
k(x
j), or, x
j∈ N
k(x
i), x wherein
j∈ X
Seeds, with the current seed set X that has had classification logotype
SeedsMerge into a new seed set with the Δ X that does not have classification logotype, be designated as
Secondly, to X
Seeds NewUse semi-supervised discriminant analysis method (reference: D.Cai, X.He, and J.Han. " Semi-supervised discriminant analysis; in Proc.IEEE Int.Conf.Computer Vision, Rio de Janeiro, Brazil ", Jun.2007.), obtain X
Seeds NewThe basic U of an optimum projector space
OptAnd with X
Seeds NewProject to U
OptIn the space that generates, the data after the projection are:
U wherein
Opt TThe sign matrix U
OptTransposition.
Then, utilize the average drifting algorithm to X
Seeds SDACarry out cluster, resulting classification logotype is composed to X
Seeds NewMiddle corresponding data.With X
Seeds NewClassification logotype set be designated as L
Seeds New
Upgrade X
SeedsAnd L
Seeds, order
The circulation said process, until
Circulation stops, and obtains final seed set X
SeedsWith cluster result L
Seeds
(8) classification logotype that obtains X is gathered L, finishes cluster: if
So with { X-X
SeedsIn data give a new classification logotype l jointly
Rest, the classification logotype set
N ∈ 1 ..., m}, wherein
Otherwise, classification logotype set L=L
Seeds
Fig. 6 is the cluster result that is obtained by algorithm of the present invention, Fig. 3-the 5th, other 3 kinds of cluster results that classical clustering algorithm obtains, of the present invention as can be seen in result and Fig. 2 originally category distribution more consistent, thereby validity of the present invention has been described.
The numerical range of the concrete proper vector of basis is chosen feature bandwidth (feature bandwidth) the parameter h in the average drifting algorithm among the present invention
r, its span is than the little order of magnitude of span of feature vector usually.In the above-described embodiments, the feature bandwidth parameter h that chooses
r=10.5.
According to an exemplary embodiment of the present invention, be used to realize that computer system of the present invention can comprise, particularly, central processing unit (CPU), storer and I/O (I/O) interface.Computer system usually by I/O interface and display with link to each other such as this type of various input equipments of mouse and keyboard, support circuit can comprise the fast buffer memory of image height, power supply, clock circuit and the such circuit of communication bus.Storer can comprise random access memory (RAM), ROM (read-only memory) (ROM), disc driver, magnetic tape station etc., or their combination.Computer platform also comprises operating system and micro-instruction code.Various process described herein and function can be by the micro-instruction code of operating system execution or the part of application program (or their combination).In addition, various other peripherals can be connected to this computer platform, as additional data storage device and printing device.
Should also be understood that and so the actual connection between the system component (or process steps) may be different, specifically decide on programming mode of the present invention because the assembly and the method step of some construction system described in the accompanying drawing can form of software be realized.Based on the principle of the invention that proposes herein, the ordinary skill of association area it is contemplated that these and similar embodiment or configuration of the present invention.
Claims (1)
1. the clustering method of a multidimensional characteristic vectors comprises the steps:
(1) m proper vector to be clustered is designated as proper vector set X={x
1, x
2..., x
m, x wherein
iBe a proper vector, i=1 ..., m;
(2) above-mentioned proper vector set X is set up a k-neighbour and scheme G
k, wherein, adopt Euclidean distance ‖ x
i-x
j‖
2Measure any two the proper vector x among the X
iAnd x
jBetween the far and near relation of distance;
(3) obtain the middle k-neighbour of step (2) and scheme G
kAdjacency matrix A, each elements A of matrix A wherein
IjCalculate by following formula (1):
Aff
(ij)Be proper vector x
iWith proper vector x
jIn abutting connection with degree, N
k(x
j) the expression vector x
jK neighbours, N
k(x
i) the expression vector x
iK neighbour, wherein aff
(ij)Calculate by following formula (2):
Wherein σ is a constant;
(4) calculate X={x
1, x
2..., x
mIn the density d en (x of each proper vector
i), computing formula is as follows:
Choose all proper vector density { den (x
i)
I=1 ..., mThe 96th hundredths (96
Th-percentile) density value is designated as threshold value T
96
(5) obtain seed set X
Seeds, X wherein
Seeds={ x
i| den (x
i)>T
96, x
i∈ X};
(6) utilize the average drifting algorithm to current seed set X
SeedsCarry out cluster, obtain the classification logotype set L of current seed set
Seeds, L wherein
SeedsIn each element for belonging to current seed set X
SeedsIn the classification logotype of proper vector, distinguish with natural number usually;
(7) to current seed set X
SeedsCarry out the increment iterative cluster:
At first, the k-neighbour from step (2) schemes G
kIn choose current seed set X
SeedsAll k neighbour data Δ X, it is defined as Δ X={x
i| x
i∈ N
k(x
j), or, x
j∈ N
k(x
i), x wherein
j∈ X
Seeds, with current seed set X
SeedsMerge into a new seed set with Δ X, be designated as
Secondly, with new seed set X
Seeds NewIn current seed set x
SeedsBe considered as having the data of classification logotype, Δ X is considered as not having the data of classification logotype, to described X
Seeds NewUse semi-supervised discriminant analysis method and obtain X
Seeds NewThe basic U of an optimum projector space
Opt, and with X
Seeds NewProject to U
OptIn the space that generates, the data after the projection are:
U wherein
Opt TThe sign matrix U
OptTransposition;
Then, utilize the average drifting algorithm to X
Seeds SDACarry out cluster, resulting classification logotype is composed to X
Seeds NewMiddle corresponding data, X
Seeds NewClassification logotype set be designated as L
Seeds New, upgrade X again
SeedsAnd L
SeedsEven,
The circulation said process, until
Circulation stops, the current seed set X that obtains upgrading
SeedsWith cluster result L
Seeds, wherein
Be empty set.
(8) classification logotype that obtains X is gathered L, finishes cluster:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010114138 CN101777126A (en) | 2010-02-10 | 2010-02-10 | Clustering method for multidimensional characteristic vectors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010114138 CN101777126A (en) | 2010-02-10 | 2010-02-10 | Clustering method for multidimensional characteristic vectors |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101777126A true CN101777126A (en) | 2010-07-14 |
Family
ID=42513584
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201010114138 Pending CN101777126A (en) | 2010-02-10 | 2010-02-10 | Clustering method for multidimensional characteristic vectors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101777126A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543536A (en) * | 2018-10-23 | 2019-03-29 | 北京市商汤科技开发有限公司 | Image identification method and device, electronic equipment and storage medium |
CN111563467A (en) * | 2020-05-13 | 2020-08-21 | 金陵科技学院 | Solar panel cleaning system based on machine vision |
CN111652148A (en) * | 2020-06-04 | 2020-09-11 | 航天科工智慧产业发展有限公司 | Face recognition method and device and electronic equipment |
CN113850281A (en) * | 2021-02-05 | 2021-12-28 | 天翼智慧家庭科技有限公司 | Data processing method and device based on MEANSHIFT optimization |
CN116628536A (en) * | 2023-07-26 | 2023-08-22 | 杭州易靓好车互联网科技有限公司 | Online transaction data processing system of automobile |
-
2010
- 2010-02-10 CN CN 201010114138 patent/CN101777126A/en active Pending
Non-Patent Citations (3)
Title |
---|
《IEEE》 20081231 Rui Huang 等 Approximation of Salient Contours in cluttered scenes 全文 1 , 2 * |
《IEEE》 20091231 Rui Huang 等 Segmentation via Incremental Transductive Learning 第II-III部分 1 , 2 * |
《计算机应用》 20050331 闫成新等 基于图划分的图像直方图聚类分割 全文 1 第25卷, 第3期 2 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543536A (en) * | 2018-10-23 | 2019-03-29 | 北京市商汤科技开发有限公司 | Image identification method and device, electronic equipment and storage medium |
CN109543536B (en) * | 2018-10-23 | 2020-11-10 | 北京市商汤科技开发有限公司 | Image identification method and device, electronic equipment and storage medium |
CN111563467A (en) * | 2020-05-13 | 2020-08-21 | 金陵科技学院 | Solar panel cleaning system based on machine vision |
CN111652148A (en) * | 2020-06-04 | 2020-09-11 | 航天科工智慧产业发展有限公司 | Face recognition method and device and electronic equipment |
CN113850281A (en) * | 2021-02-05 | 2021-12-28 | 天翼智慧家庭科技有限公司 | Data processing method and device based on MEANSHIFT optimization |
CN113850281B (en) * | 2021-02-05 | 2024-03-12 | 天翼数字生活科技有限公司 | MEANSHIFT optimization-based data processing method and device |
CN116628536A (en) * | 2023-07-26 | 2023-08-22 | 杭州易靓好车互联网科技有限公司 | Online transaction data processing system of automobile |
CN116628536B (en) * | 2023-07-26 | 2023-10-31 | 杭州易靓好车互联网科技有限公司 | Online transaction data processing system of automobile |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
İnkaya et al. | Ant colony optimization based clustering methodology | |
Ghamisi et al. | Multilevel image segmentation based on fractional-order Darwinian particle swarm optimization | |
Xiang et al. | Superpixel generating algorithm based on pixel intensity and location similarity for SAR image classification | |
dos Santos et al. | A relevance feedback method based on genetic programming for classification of remote sensing images | |
Kiang et al. | An evaluation of self-organizing map networks as a robust alternative to factor analysis in data mining applications | |
Bue et al. | Automated classification of landforms on Mars | |
CN104346620A (en) | Inputted image pixel classification method and device, and image processing system | |
CN103985112B (en) | Image segmentation method based on improved multi-objective particle swarm optimization and clustering | |
CN111723815B (en) | Model training method, image processing device, computer system and medium | |
JP6863926B2 (en) | Data analysis system and data analysis method | |
CN101777126A (en) | Clustering method for multidimensional characteristic vectors | |
CN112669298A (en) | Foundation cloud image cloud detection method based on model self-training | |
Ma et al. | A new clustering algorithm based on a radar scanning strategy with applications to machine learning data | |
Seiler et al. | A collection of deep learning-based feature-free approaches for characterizing single-objective continuous fitness landscapes | |
CN106022359A (en) | Fuzzy entropy space clustering analysis method based on orderly information entropy | |
CN113436223B (en) | Point cloud data segmentation method and device, computer equipment and storage medium | |
Dammak et al. | Histogram of dense subgraphs for image representation | |
CN113553442A (en) | Unsupervised event knowledge graph construction method and system | |
Park et al. | Seed growing for interactive image segmentation using SVM classification with geodesic distance | |
Beilschmidt et al. | A linear-time algorithm for the aggregation and visualization of big spatial point data | |
Guo et al. | Building fuzzy areal geographical objects from point sets | |
Sigut et al. | Automatic marker generation for watershed segmentation of natural images | |
CN113486879A (en) | Image area suggestion frame detection method, device, equipment and storage medium | |
Singh et al. | Adaptive multiscale feature extraction in a distributed system for semantic classification of airborne LiDAR point clouds | |
Brodić et al. | Classification of the scripts in medieval documents from Balkan region by run-length texture analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20100714 |