CN101777126A

CN101777126A - Clustering method for multidimensional characteristic vectors

Info

Publication number: CN101777126A
Application number: CN 201010114138
Authority: CN
Inventors: 黄锐; 桑农; 唐奇伶; 高俊; 高常鑫
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2010-02-10
Filing date: 2010-02-10
Publication date: 2010-07-14

Abstract

The invention discloses a clustering method for multidimensional characteristic vectors. Based on the observation of characteristic space data distribution, the method of the invention starts with high density area data which can be clustered most easily and generate space consistency results, and provides a clustering method under an incremental-iterative mode. Each step of iterative selects data with higher density as a seed set, and through the seed growing process, the data is organized, so that each step of clustering in the iterative process is finished on the data with the relatively highest density. The result shows that the clustering method of the invention can generate good results which can not be obtained by the classical clustering algorithm.

Description

A kind of clustering method of multidimensional characteristic vectors

Technical field

The invention belongs to area of pattern recognition, be specifically related to a kind of method the multidimensional characteristic vectors cluster.

Background technology

It is an important information processing means in the fields such as pattern-recognition, computer vision, data mining that sample data in the feature space is carried out cluster.Data are carried out after the cluster, not only can reduce the data volume of required processing, from cluster result, also can find the similarity rule between the data simultaneously.The good clustering method of robustness should be able to be partitioned into the data point in the feature space some disjoint subclass (each subclass is considered as a class), the distance that belongs between the data point in the same subclass (class) is as much as possible little, and the distance that belongs between the data point of different subclass (class) is big as much as possible.The present invention is called Space Consistency (spatiallycoherent) with above-mentioned robustness good properties.

At present, classical clustering method has, (the reference: J.MacQueen of K-means clustering algorithm, " SomeMethods for Classification and Analysis of Multivariate Observations ", Proc.Fifth Berkeley Symp.Math., Statistics, and Probability, 1967:281-297), (the reference: J.Shi and J.Malik of Normalized Cut clustering algorithm, " Normalized cuts andimage segmentation ", IEEE Trans.Pattern Anal.Mach.Intell., 2000,22 (8): 888-905), and (the reference: D.Comaniciu and P.Meer of average drifting clustering algorithm, " Meanshift:A robust approach toward feature space analysis ", IEEE Trans.PatternAnal.Mach.Intell., 2002,24 (5): 603-619) etc.Usually, given proper vector to be clustered when using K-means clustering algorithm and Normalized Cut clustering algorithm, by specifying the classification number of wishing generation, can obtain a cluster result.And when using the average drifting clustering algorithm, need to specify a feature bandwidth parameter (feature bandwidth), by this parameter, the average drifting clustering algorithm is estimated the continuous local high-density region of seeking in the feature space by non-parametric density, and the proper vector that will belong in certain local high-density region is classified as same classification.If data to be clustered present the bulk distribution (scattering blob-like distribution) of dispersion in feature space, be that each data point all is distributed in some high-density regions (group), and data point very sparse (it is discontinuous to occur density between the high-density region) between these high-density regions, in this case, above-mentioned classical clustering algorithm can be exported a cluster result with Space Consistency (in fact, have the data itself that the bulk of dispersion distributes and had Space Consistency) effectively.Yet in actual applications, data acquisition to be clustered often is not that the bulk that presents dispersion distributes in feature space, and for example, in computer vision field, the proper vector of extracting from image often presents complicated flow pattern and distributes.Directly the data that these present complex distributions are carried out cluster, often can't obtain having the result of Space Consistency with above-mentioned classical clustering method.A most important reason is exactly, present in the real data of complex distributions at these, still there is the lower data point of some relative densities in the tangible border of neither one often between the high-density region between the high-density region, it is enough sparse that these low-density data points do not reach.At present, in pattern classification and machine learning field, people reach common understanding for the research of cluster, that is, the uncertainty of cluster result often appears on the data point of density regions in the feature space.

In addition, if when proper vector is difficult to obtain good cluster result in original feature space, proper vector is transformed into carries out cluster a kind of good idea of can yet be regarded as on the new feature space of another one again.In the prior art a kind of semi-supervised differentiation algorithm (reference: D.Cai has appearred at present, X.He, and J.Han. " Semi-supervised discriminant analysis; in Proc.IEEE Int.Conf.Computer Vision, Rio de Janeiro, Brazil ", Jun.2007.), after adopting this algorithm that the primitive character space is changed, proper vector often has good separability in new feature space, and this is with highly beneficial and cluster proper vector.

Summary of the invention

The object of the present invention is to provide a kind of clustering method of multidimensional characteristic vectors, the cluster result that is obtained by this method has more Space Consistency, i.e. cluster result robust more.Thereby, the affiliated situation of classification between the proper vector is more objectively described.

Multidimensional characteristic vectors clustering method provided by the invention the steps include:

(1) m proper vector to be clustered is designated as proper vector set X={x ₁, x ₂..., x _m, x wherein _iBe a proper vector, i=1 ..., m;

(2) above-mentioned proper vector set X is set up a k-neighbour and scheme G _k, wherein, adopt Euclidean distance ‖ x _i-x _j‖ ₂Measure any two the proper vector x among the X _iAnd x _jBetween the far and near relation of distance;

(3) obtain the middle k-neighbour of step (2) and scheme G _kAdjacency matrix A, each elements A of matrix A wherein _IjCalculate by following formula (1):

Aff _(ij)Be proper vector x _iWith proper vector x _jIn abutting connection with degree, N _k(x _j) the expression vector x _jK neighbours,

N _k(x _i) the expression vector x _iK neighbour, wherein aff _(ij)Calculate by following formula (2):

{aff}_{(ij)} = \exp {\frac{- {| | x_{i} - x_{j} | |}_{2}}{2 σ^{2}}} - - - (2)

Wherein σ is a constant;

(4) calculate X={x ₁, x ₂..., x _mIn the density d en (x of each proper vector _i), computing formula is as follows:

den (x_{i}) = Σ_{j = 1}^{m} A_{ij} - - - (3)

Choose all proper vector density { den (x _i) _I=1..., the 96th hundredths (96 of m ^Th-percentile) density value is designated as threshold value T ₉₆

(5) obtain seed set X _Seeds, X wherein _Seeds={ x _i| den (x _i)＞T ₉₆, x _i∈ X};

(6) utilize the average drifting algorithm to current seed set X _SeedsCarry out cluster, obtain the classification logotype set L of current seed set _Seeds, L wherein _SeedsIn each element for belonging to current seed set X _SeedsIn the classification logotype of proper vector, distinguish with natural number usually;

(7) to current seed set X _SeedsCarry out the increment iterative cluster:

At first, the k-neighbour from step (2) schemes G _kIn choose current seed set X _SeedsAll k neighbour data Δ X, it is defined as Δ X={x _i| x _i∈ N _k(x _j), or, x _j∈ N _k(x _i), x wherein _j∈ X _Seeds, with current seed set X _SeedsMerge into a new seed set with Δ X, be designated as

Secondly, with new seed set X _Seeds ^NewIn current seed set X _SeedsBe considered as having the data of classification logotype, Δ X is considered as not having the data of classification logotype, to described X _Seeds ^NewUse semi-supervised discriminant analysis method and obtain X _Seeds ^NewThe basic U of an optimum projector space _Opt, and with X _Seeds ^NewProject to U _OptIn the space that generates, the data after the projection are:

U wherein _Opt ^TThe sign matrix U _OptTransposition;

Then, utilize the average drifting algorithm to X _Seeds ^SDACarry out cluster, resulting classification logotype is composed to X _Seeds ^NewMiddle corresponding data, X _Seeds ^NewClassification logotype set be designated as L _Seeds ^New, and upgrade X _SeedsAnd L _SeedsEven,

The circulation said process, until Circulation stops, the current seed set X that obtains upgrading _SeedsWith cluster result L _Seeds, wherein

Be empty set.

(8) classification logotype that obtains X is gathered L, finishes cluster:

If

So with { X-X _SeedsIn data give a new classification logotype l jointly _Rest, the classification logotype set

N ∈ 1 ..., m}, wherein

Otherwise, classification logotype set L=L _Seeds

Description of drawings

Fig. 1 is the process flow diagram of the inventive method;

Fig. 2 is 1071 three proper vectors, and wherein the sign of 3 kinds of colors has been represented 3 classifications, and the cluster result that effective clustering algorithm obtains should be consistent with the distribution of three classifications among Fig. 2.

Fig. 3 is the cluster result of K-means algorithm.

The cluster result of Fig. 4 Normalized Cut algorithm.

The cluster result of Fig. 5 average drifting algorithm.

The cluster result of clustering algorithm among Fig. 6 the present invention.

Embodiment

The present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

In the present embodiment, as shown in Figure 1, detailed process is:

(1) m proper vector to be clustered is designated as proper vector set X={x ₁, x ₂..., x _m, x wherein _iBe a proper vector, i=1 ..., m.Usually the span of the number m of proper vector is 10 ²～10 ³On the order of magnitude.Shown in Fig. 2 is m=1071, x _iSpatial distribution map when being one 3 dimensional feature vector.

(2) proper vector set X is set up a k-neighbour and scheme G _k, the k value is generally 5-7, and the k value is 7 in the present embodiment.Setting up figure G _kThe time, for any two the proper vector x among the X _iAnd x _j, adopt Euclidean distance ‖ x _i-x _j‖ ₂Measure the far and near relation of distance of two-value.

(3) obtain the middle k-neighbour of step (2) and scheme G _kAdjacency matrix A.Each elements A of matrix A wherein _IjCalculate by formula (1):

Aff _(ij)Be proper vector x _iWith proper vector x _jIn abutting connection with degree, N _k(x _j) the expression vector x _jK neighbours, N _k(x _i) the expression vector x _iK neighbours.Aff wherein _(ij)Calculate by formula (2):

{aff}_{(ij)} = \exp {\frac{- {| | x_{i} - x_{j} | |}_{2}}{2 σ^{2}}} - - - (2)

σ is a constant, and its value is different and different according to the type of proper vector, and its span is than little two orders of magnitude of span of feature vector usually.In the present embodiment, proper vector x _iSpan is 0-255, and we choose σ=3.

den (x_{i}) = Σ_{j = 1}^{m} A_{ij} - - - (3)

Choose all proper vector density { den (x _i) _{I=1 ..., m}The 96th hundredths (96 ^Th-percentile) density value is designated as threshold value T ₉₆

(5) choose seed set X _Seeds, X wherein _Seeds={ x _i| den (x _i)＞T ₉₆, x _i∈ X}.

(6) utilize (reference: D.Comaniciu and P.Meer of average drifting algorithm, " Mean shift:A robustapproach toward feature space analysis ", IEEE Trans.Pattern Anal.Mach.Intell., 2002,24 (5): 603-619) to current seed set X _SeedsCarry out cluster, obtain the classification logotype set L of current seed set _Seeds, L wherein _SeedsIn each element for belonging to current seed set X _SeedsIn the classification logotype of proper vector, distinguish with natural number usually.

(7) increment iterative cluster:

At first, from the k-neighbour G of step (2) _kIn choose current seed set X _SeedsAll k neighbour data, it is defined as Δ X={x _i| x _i∈ N _k(x _j), or, x _j∈ N _k(x _i), x wherein _j∈ X _Seeds, with the current seed set X that has had classification logotype _SeedsMerge into a new seed set with the Δ X that does not have classification logotype, be designated as

Secondly, to X _Seeds ^NewUse semi-supervised discriminant analysis method (reference: D.Cai, X.He, and J.Han. " Semi-supervised discriminant analysis; in Proc.IEEE Int.Conf.Computer Vision, Rio de Janeiro, Brazil ", Jun.2007.), obtain X _Seeds ^NewThe basic U of an optimum projector space _OptAnd with X _Seeds ^NewProject to U _OptIn the space that generates, the data after the projection are:

U wherein _Opt ^TThe sign matrix U _OptTransposition.

Then, utilize the average drifting algorithm to X _Seeds ^SDACarry out cluster, resulting classification logotype is composed to X _Seeds ^NewMiddle corresponding data.With X _Seeds ^NewClassification logotype set be designated as L _Seeds ^New

Upgrade X _SeedsAnd L _Seeds, order

X_{seeds} = X_{seeds}^{new},

L_{seeds} = L_{seeds}^{new} .

The circulation said process, until Circulation stops, and obtains final seed set X _SeedsWith cluster result L _Seeds

(8) classification logotype that obtains X is gathered L, finishes cluster: if

N ∈ 1 ..., m}, wherein

Otherwise, classification logotype set L=L _Seeds

Fig. 6 is the cluster result that is obtained by algorithm of the present invention, Fig. 3-the 5th, other 3 kinds of cluster results that classical clustering algorithm obtains, of the present invention as can be seen in result and Fig. 2 originally category distribution more consistent, thereby validity of the present invention has been described.

The numerical range of the concrete proper vector of basis is chosen feature bandwidth (feature bandwidth) the parameter h in the average drifting algorithm among the present invention _r, its span is than the little order of magnitude of span of feature vector usually.In the above-described embodiments, the feature bandwidth parameter h that chooses _r=10.5.

According to an exemplary embodiment of the present invention, be used to realize that computer system of the present invention can comprise, particularly, central processing unit (CPU), storer and I/O (I/O) interface.Computer system usually by I/O interface and display with link to each other such as this type of various input equipments of mouse and keyboard, support circuit can comprise the fast buffer memory of image height, power supply, clock circuit and the such circuit of communication bus.Storer can comprise random access memory (RAM), ROM (read-only memory) (ROM), disc driver, magnetic tape station etc., or their combination.Computer platform also comprises operating system and micro-instruction code.Various process described herein and function can be by the micro-instruction code of operating system execution or the part of application program (or their combination).In addition, various other peripherals can be connected to this computer platform, as additional data storage device and printing device.

Should also be understood that and so the actual connection between the system component (or process steps) may be different, specifically decide on programming mode of the present invention because the assembly and the method step of some construction system described in the accompanying drawing can form of software be realized.Based on the principle of the invention that proposes herein, the ordinary skill of association area it is contemplated that these and similar embodiment or configuration of the present invention.

Claims

1. the clustering method of a multidimensional characteristic vectors comprises the steps:

Aff _(ij)Be proper vector x _iWith proper vector x _jIn abutting connection with degree, N _k(x _j) the expression vector x _jK neighbours, N _k(x _i) the expression vector x _iK neighbour, wherein aff _(ij)Calculate by following formula (2):

{aff}_{(ij)} = \exp {\frac{{- | | x_{i} - x_{j} | |}_{2}}{{2 σ}^{2}}} - - - (2)

Wherein σ is a constant;

den (x_{i}) = Σ_{j = 1}^{m} A_{ij} - - - (3)

(7) to current seed set X _SeedsCarry out the increment iterative cluster:

Secondly, with new seed set X _Seeds ^NewIn current seed set x _SeedsBe considered as having the data of classification logotype, Δ X is considered as not having the data of classification logotype, to described X _Seeds ^NewUse semi-supervised discriminant analysis method and obtain X _Seeds ^NewThe basic U of an optimum projector space _Opt, and with X _Seeds ^NewProject to U _OptIn the space that generates, the data after the projection are: U wherein _Opt ^TThe sign matrix U _OptTransposition;

Then, utilize the average drifting algorithm to X _Seeds ^SDACarry out cluster, resulting classification logotype is composed to X _Seeds ^NewMiddle corresponding data, X _Seeds ^NewClassification logotype set be designated as L _Seeds ^New, upgrade X again _SeedsAnd L _SeedsEven,

The circulation said process, until

Circulation stops, the current seed set X that obtains upgrading _SeedsWith cluster result L _Seeds, wherein

Be empty set.

(8) classification logotype that obtains X is gathered L, finishes cluster:

If

N ∈ 1 ..., m}, wherein

Otherwise, classification logotype set L=L _Seeds