CN107704872A

CN107704872A - A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method

Info

Publication number: CN107704872A
Application number: CN201710844898.4A
Authority: CN
Inventors: 吴造林; 胡长俊
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2017-09-19
Filing date: 2017-09-19
Publication date: 2018-02-16

Abstract

The invention discloses a kind of method that K means cluster initial center points based on relatively most discrete dimension segmentation are chosen.This method thinking is：The data set of given D dimensions, s1. carry out dimension-reduction treatment to data set；S2. the dispersion degree of each dimension of data set after dimensionality reduction is assessed；S3. select relatively most discrete dimension to be split, all data are divided into two classes according to the dimension average point；S4. one kind that data point is most in the classification after being split is chosen, relatively most discrete dimension is chosen according to s2 and s3, is continued to be split according to most discrete Wei Junzhidianchu, according to above-mentioned steps untill being divided into required classification number；S5. data in each classification split are averaged；S6. the average of each classification is carried out into a liter dimension to operate, and as the initial center point of K means clusters.The beneficial effects of the invention are as follows：Data after dimensionality reduction can reduce operand, accelerate arithmetic speed so that K means clusters can reach higher cluster accuracy rate with less iterations.

Description

A kind of K-means cluster initial centers based on relatively most discrete dimension segmentation are chosen Method

Technical field

The present invention relates to data mining technology field, more particularly to a kind of K-means based on relatively most discrete dimension segmentation Cluster initial center choosing method.

Background technology

The process that the set of physics or abstract object is divided into the multiple classes being made up of similar object is referred to as clustering.

Cluster is exactly that one group of individual is returned into some classifications according to similitude, i.e., " things of a kind come together, people of a mind fall into the same group ".Its purpose is so that Distance is as small as possible between belonging to same category of individual, and the distance between different classes of individual is as big as possible.Each class is also known as For cluster, the similitude of object is higher in cluster, and the similitude of object is relatively low between cluster.According to this feature, cluster, which can be divided into, to be based on Clustering algorithm of division, density, level and grid etc..

K-means is a kind of Classic Clustering Algorithms based on division, because the characteristics of its is easy and effective is widely used in counting According in the tasks such as excavation, machine learning, pattern-recognition.

K-means algorithms are hard clustering algorithms, are the representatives of the typically object function clustering method based on prototype, it is Data point asks the method for extreme value to obtain the tune of interative computation to certain object function of distance as optimization of prototype using function Whole rule.For K-means algorithms using Euclidean distance as similarity measure, it is to seek corresponding a certain initial cluster center vector V most Optimal sorting class so that evaluation index J is minimum.Algorithm is using error sum of squares criterion function as clustering criteria function.

K-means general principles are as follows：

If data acquisition system to be clustered：X={ x_i|x_i∈R^D, i=1,2,3 ..., N }；

It is K to cluster classification number, and K cluster centre is respectively C₁,C₂,…,C_K；

K initial center is randomly choosed from N number of data object；

The distance of each object and cluster center (average point) object is calculated, and is distributed to object accordingly according to minimum distance Cluster；

Distance is defined as Euclidean distance between object, and (x is set between any two data point_i, x_j) Euclidean distance is

Recalculate the average of each cluster；

[0010]~[0012] is repeated, untill object function no longer changes.

Each object and the Euclidean distance sum at the cluster center of cluster where the object are referred to as object function in data set, general to use Alphabetical J is represented；

Above-mentioned is traditional K-means algorithms, and traditional K-means algorithms are easy to be influenceed by initial center point, such as Fruit initial center is clicked bad, and may increase iterations causes amount of calculation to increase, or even cluster can be made to be absorbed in part most Excellent solution, desired effect can not be reached.

The content of the invention

Implementation of the present invention proposes a kind of K-means initial center points choosing method, it is possible to reduce K-means algorithms Iterations.

Implementation of the present invention proposes a kind of K-means initial center points choosing method, and adopting said method can cause K- Means cluster result accuracys rate are higher, will not be absorbed in local optimum.

In traditional K-means algorithms, initial cluster center is all randomly assigned, such that K-means was clustered Iterations increase in journey, and local optimum is absorbed in, therefore, in the present invention, design a kind of initial center point choosing method It is as follows：

Give any data set U (including N number of D dimensions strong point) to be clustered, and cluster numbers K；

Data set X is obtained to data set progress dimensionality reduction first and (includes N number of d dimensions strong point, d<=D) because in original number According to concentration, it is understood that there may be linearly related dimension, by the way that the dimension of linear correlation to be transformed to the expression of linear independence, so as to extract The principal character component of data.

Assessed in data set Y per one-dimensional dispersion degree, namely calculate the variance per one-dimensional data, if x_ijRepresent The value of the jth dimension of i data, and define X_jRepresent the value of all data of jth dimension；

It is as follows to calculate dispersion degree method：

Var () represents to seek variance in above-mentioned formula, after seeking variance to every one-dimensional data in Y, takes variance yields maximum that It is one-dimensional to be used as most discrete dimension.

The most discrete dimension data of note is S (S is the vector of N × 1).

Calculate vectorial S average Ms=mean (S).

The data point that will be more than in S corresponding to Ms value is divided into first box, is less than data corresponding to Ms value in S Point is divided into second box.

The data point number in each box is calculated, and selects the box comprising most data points as next time to be operated Data set, it is designated as box_max.

According to [0020]-[0026], box_max is continued to be divided into two box, and repeated the above steps, until box's Untill quantity is consistent with cluster numbers K.

Calculate the average M of data in each box_b=[M_b1,M_b2,...,M_bK]^T, M_bFor K × d matrix.

According to the dimension reduction method in [0019], by M_bCarry out a liter dimension for dimension to initial data and obtain C_b=[C_b1, C_b2,...,C_bK]^T, C_bFor K × D matrix.

C now_bAs K-means clusters initial center.

Brief description of the drawings

Fig. 1 is that the K-means based on relatively most discrete dimension segmentation in the embodiment of the present invention clusters initial center choosing method Flow chart.

Fig. 2 is the Visual Graph of certain two-dimentional data set.

Fig. 3 is the initial center point and cluster selected in the embodiment of the present invention using the initial center choosing method of the present invention Central point comparison diagram afterwards, red small circle represent the initial center point that the inventive method is chosen, and the small square frame of black represents cluster Center of all categories afterwards.

Fig. 4 is the cluster result in the initial center of the inventive method selection.

Embodiment

To make the purpose of the present invention, technical scheme and advantage become apparent from, and the present invention are made below in conjunction with the accompanying drawings further Detailed description.

Embodiment and flow are as follows：

For data visualization, two-dimemsional number strong point only is used as examples of implementation, data-oriented collection U (420 2-D datas Point), cluster numbers 4.

Because data set U is 2-D data, therefore no longer dimensionality reduction, this measure is had no effect on to effective after high dimensional data dimensionality reduction Property.

The relatively most discrete dimension of 2-D data is assessed, data set is divided into two according to the dimension.

That data set that data point is more in two datasets is chosen, assesses the data set relatively most discrete dimension, then Cutting operation is carried out to it, now, forms 3 data sets.

Continue above-mentioned steps, a data set more by three data intensive data points are chosen, assess its relatively most from Cutting operation is carried out after dissipating dimension, is classified as two datasets, now, forms four data sets, it is consistent with clusters number.

Average is calculated to this four data sets, because not carrying out dimensionality reduction operation to data set U at the beginning, is not required here A liter dimension operation is carried out to U, the average of this four data sets is K-means cluster initial centers.Chosen by the inventive method Initial center see Fig. 3.

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.

Claims

1. a kind of K-means cluster centres point initial method, it is characterised in that this method includes：

Data set containing N number of data is tieed up to any given D, by the table that the data set transformation is linear independence between one group of each dimension Show, available for the principal character component for extracting the data set, i.e., dimensionality reduction is carried out to the data set.Then chosen for the data set Relatively most discrete dimension is split according to average point, after required cluster classification number is obtained after segmentation, then is partitioned into each Classification is averaged, and carries out the initial center point that the data point after liter dimension operation clusters as K-means.

2. K-means cluster centres point initial method according to claim 1, it is characterised in that described basis is relatively most The method of discrete dimension segmentation includes：

The dispersion degree of every one-dimensional data is assessed, for that relatively most discrete dimension, using the average of the dimension as Threshold point, data set are divided into two classes according to the dimension, and the data point of the two classes is placed into two box respectively, then at this The box for possessing more data points is chosen in two box as next data set for preparing operation, is designated as box_max；For Box_max, continue to calculate per one-dimensional variance, choose according to the method described above it is most discrete that is one-dimensional, according to the dimension to box_max Split, box_max is divided into two box again, repeated above-mentioned way, be until box quantity is consistent with cluster numbers K Only.

3. according to most separate division method relatively in claim 2, it is characterised in that described most discrete method includes relatively：

Before data segmentation is carried out, every one-dimensional data is normalized (because may not be in same per one-dimensional data One order of magnitude), calculated after normalization per one-dimensional variance, that dimension for selecting variance maximum is used as relatively most discrete dimension.

4. according to K-means cluster centres point initial method in claim 1, it is characterised in that described dimensionality reduction rises dimension side Method includes：

Concentrated in initial data, it is understood that there may be linearly related dimension, by the way that the dimension of linear correlation is transformed into linear independence Represent, so as to extract the principal character component of data, the transformation matrix used in dimensionality reduction can be operated with by liter dimension below So that the initial cluster center point after dimensionality reduction is reduced into initial dimension D.