CN102426631A

CN102426631A - High-dimension space mapping-based K harmonic mean clustering method

Info

Publication number: CN102426631A
Application number: CN 201110341012
Authority: CN
Inventors: 王建宇; 康其桔; 马鹏飞; 孙丽娟; 陆源; 何新; 王凯; 田乃鲁
Original assignee: Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Current assignee: Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Priority date: 2011-11-01
Filing date: 2011-11-01
Publication date: 2012-04-25

Abstract

The invention discloses a high-dimension space mapping-based K harmonic mean clustering method. In the method, supposing that sample data has had a space vector form, the space vector data is mapped to a higher-dimension space and then K harmonic mean is introduced to perform data clustering; and the method specifically comprises the following steps of: (1) processing data; (2) selecting an initialization clustering center of the data; (3) mapping a distance measure to the high-dimension space; (4) substituting the mapped distance measure to calculate a harmonic distance of a data sample; (5) performing K mean clustering by taking the harmonic distance as the distance measure; and (6) outputting a result. By using the method, the sensitivity of the conventional K mean algorithm on an initial value can be effectively improved, and clustering error caused by data aliasing is greatly improved.

Description

A kind of K harmomic mean clustering method based on the higher dimensional space mapping

Technical field

The present invention relates to computational science and Intelligent Information Processing field, especially data set is carried out the technology of cluster, specifically a kind of K harmomic mean clustering method based on the higher dimensional space mapping.

Background technology

Cluster analysis is the basis of further analysis and deal with data as a kind of data preprocessing method, and cluster analysis becomes indispensable important tool in handling large-scale data.At present; The most frequently used data clustering method is a K mean cluster method; Can solve the clustered demand in the Intelligent Information Processing process to a certain extent though experiment showed, this method, but this method is very responsive to the randomness of initialization cluster centre; And can't solve the data aliasing problem in the practical engineering application, so this method can not be applicable to the demand of current large-scale complex data clusters.Therefore active demand is a kind of very responsive and can solve the clustering method of data aliasing problem to the initialization center that clusters.

Summary of the invention

The object of the present invention is to provide a kind of K harmomic mean clustering method based on the higher dimensional space mapping, this method can make large-scale complex data clusters result stable and more accurate.

To achieve these goals, technical scheme of the present invention is: a kind of K harmomic mean clustering method based on the higher dimensional space mapping, it comprises the steps:

(1) be the space vector form with original data processing, promptly each data sample all exists with the form of hyperspace vector;

(2) the initialization cluster centre of selection data;

(3) distance measure is mapped to higher dimensional space;

(4) distance measure after will shining upon is brought the mediation distance of calculating sample point into;

(5) be that distance measure carries out the K mean cluster with this mediation distance;

(6) result's output.

In order can to differentiate preferably, to extract and amplify useful characteristic, thereby realize cluster more accurately, the distance measure in the above-mentioned steps (3) is the included angle cosine value, and adopts the Mercer kernel function that the included angle cosine value is mapped to higher dimensional space.

The present invention has the following advantages:

The K harmomic mean clustering method based on the higher dimensional space mapping that the present invention is directed to the data clusters design under the complicated occasion can be stablized cluster exactly to point-like space vector data, realizes the converging operation different classes of to data.In the distance metric field, utilize radially basic kernel function that cosine tolerance is mapped to higher-dimension and calculate, can effectively separate the aliasing data, the cosine measure for traditional has very big advantage.

Description of drawings

Accompanying drawing is the process flow diagram of the inventive method.

Embodiment

Method step of the present invention is shown in accompanying drawing, and is clear in order to explain, and will describe specific embodiment of the present invention step by step below.

(1) data processing.

The data basis of this method is a form space vector form the most widely in this area, and promptly each data sample all is that form with the hyperspace vector exists.Because of most of real datas all are the form appearance with the hyperspace vector, so the concrete grammar of data processing does not belong to content of the present invention, this step is merely the data that the used data of explanation this method should be the space vector form.

(2) select the data initialization cluster centre.

Involved in the present invention to the field be data clusters, so answer the expection classification of specific data to count K.The present invention is directed to the expection classification and count K, select K initialization cluster centre.Because of the present invention for primary data and insensitive, so present embodiment for randomly drawing K data sample as the initialization cluster centre, cluster centre is gathered and is designated as C _l=[C _L1, C _L2..., C _Lm], wherein l is the iterations of cluster centre, C _LmBe the cluster centre after m classification l wheel calculates.

(3) distance measure is mapped to higher dimensional space.

The distance measure of present embodiment is the included angle cosine value; Carry out the mapping of Mercer kernel function for included angle cosine tolerance; Because of the Mercer kernel function has key property; Be about to low dimension data and pass through Nonlinear Mapping to higher-dimension, can differentiate, extract and amplify useful characteristic preferably, thereby realize cluster more accurately.Be without loss of generality, present embodiment uses that comparatively typical gaussian kernel function describes in the Mercer kernel function, the distance measure (formula (1)) between two data samples after the mapping as follows:

d (l_{1}, l_{2}) = \exp (\frac{\cos^{2} (l_{1}, l_{2})}{σ^{2}}) - - - (1)

(4) distance measure after will shining upon is brought the mediation distance of calculating between the sample point into.

In traditional K mean cluster method, distance calculating method is the minor increment of computational data point and cluster centre.And in the present invention, distance calculating method promptly uses the harmonic average of data point and all cluster centres to substitute Traditional calculating methods, thereby has introduced dynamic weighting for adopting the mediation distance, and hard cluster is softening.

(5) be that distance measure carries out the K mean cluster with this mediation distance.

Through aforementioned calculation, the cluster centre C of l class in the K mean cluster method _lChange formula (formula (2)) and cluster objective function E _KHMComputing formula (formula (3)) is distinguished as follows:

C_{l} = \frac{Σ_{i = 1}^{n} \frac{1}{{(Σ_{j = 1}^{k} \frac{d_{i, l}^{2}}{d_{i, j}^{2}})}^{2}}}{Σ_{i = 1}^{n} \frac{1}{{(Σ_{j = 1}^{k} \frac{d_{i, l}^{2}}{d_{i, j}^{2}})}^{2}}} - - - (2)

E_{KHM} = Σ_{i = 1}^{n} \frac{k}{Σ_{l = 1}^{k} \frac{1}{d (X_{i}, C_{i})}} - - - (3)

X wherein _iBe i sample point, the d in formula (2) and the formula (3) is calculated by formula (1), does not stop the iteration cluster centre by formula (2), and is stable until formula (3) result, and then cluster process finishes.

(6) result's output.

In the art, the method for result's output is more, and the present invention does not relate to concrete output form as a result, and only defining this step is one of necessary step of the present invention.

The foregoing description does not limit the present invention in any way, and every employing is equal to the technical scheme that replacement or the mode of equivalent transformation obtain and all drops in protection scope of the present invention.

Claims

1. the K harmomic mean clustering method based on the higher dimensional space mapping is characterized in that comprising the steps:

(1) be the space vector form with original data processing;

(2) the initialization cluster centre of selection data;

(3) distance measure is mapped to higher dimensional space;

(6) result's output.

2. the K harmomic mean clustering method based on the higher dimensional space mapping according to claim 1, it is characterized in that: the distance measure in the said step (3) is the included angle cosine value.

3. the K harmomic mean clustering method based on the higher dimensional space mapping according to claim 2 is characterized in that: adopt the Mercer kernel function that the included angle cosine value is mapped to higher dimensional space in the said step (3).