CN104731760A

CN104731760A - K-means data processing method based on data density and Huffman tree

Info

Publication number: CN104731760A
Application number: CN201510184419.1A
Authority: CN
Inventors: 邓燕妮; 褚四勇; 涂林丽; 尉成勇; 邓智斌; 龚良文; 赵东明; 傅剑; 刘小珠
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2015-04-17
Filing date: 2015-04-17
Publication date: 2015-06-24

Abstract

The invention discloses a k-means data processing method based on the data density and Huffman tree. The method includes the steps of 1, calculating density of each point Pi in a data set U given; 2, calculating average density Mav of each point Pi in the data set U; 3, putting the points each having the density greater than the average density Mav, into a set V, and generating the Huffman tree in the set V; 4, reversely deleting k-1 points from the Huffman tree generated, to acquire k points; 5, setting remaining k points as an initial clustering center; 6, calculating distances from all points of the data set U to the k points, and reducing the points to a cluster the closet to them; 7, adjusting the clustering center, and moving the clustering center to a geometrical center of class; 8, repeating the steps 6 and 7 until that the clustering center never moves again, thus acquiring clustering results. The method has the advantages that classifying accuracy is guaranteed and operating time of algorithm is greatly shortened.

Description

A kind of k mean data disposal route set based on packing density and Huffman

Technical field

The present invention relates to data processing method, particularly relate to a kind of k mean data disposal route set based on packing density and Huffman.

Background technology

K mean algorithm is that MacQueen proposed first in 1967, its core concept need be divided into k class by data set to be processed, and make the similarity in class the highest, similarity between class is minimum, because the thought of its algorithm is clear, fast convergence rate, there is good retractility and used widely, but its shortcoming i.e. obvious, algorithm to the determination of the selection of initial center point and k value and isolated point more responsive, and be easily absorbed in locally optimal solution, for these shortcomings of k average, Wang Xiufen, Zhang Junwei, the method that Lu Jing etc. propose based on minimax distance selects initial center point, Li Youming, Xu Huyin, Liu Yanli etc. propose the method for density based to select the impact of initial center point and elimination isolated point, the method that Wu Xiaorong, WangShunye propose based on Huffman tree selects initial center point, the artificial fish-swarm algorithm that HaiTao Yu, XiaoxuCheng propose compensate for the shortcoming that k mean algorithm easily converges on local.But larger for data set, time dimension is higher, the time loss that also there is algorithm is very large, the problem that speed of convergence is slow.

Summary of the invention

The technical problem to be solved in the present invention is for defect of the prior art, a kind of k mean data disposal route set based on packing density and Huffman is provided, this method optimizes initial cluster center point relative to traditional k mean algorithm, improves the accuracy of classification; Relative to the independent method based on Huffman tree, ensure that under the prerequisite at classification accuracy rate, greatly reducing the working time of algorithm.

The technical solution adopted for the present invention to solve the technical problems is: a kind of k mean data disposal route set based on packing density and Huffman, comprises the following steps:

1) to given data set U, the density M (Pi) of each some Pi in data set U is calculated; The density of described any point Pi take Pi as the center of circle, and Eps is the number of the object comprised in the scope of the circle of radius;

Wherein Eps is given radius;

2) to calculate in data set U have the average density M of a Pi _av;

Wherein, n is the number of data centralization object;

3) density is greater than M _avpoint be put in set V, in set V, generate Huffman tree: 2 points that each combined distance is nearest, and represent new by the average of attribute of two objects producing this object and merge the attribute that institute produces object, put until only remain the next one in set V;

4) K-1 the point that the Huffman that inverted deleting generates sets, obtains k point;

5) setting k the point obtained is initial cluster centre;

6) calculate all points minute in data U and be clipped to the distance of this k point, point is grouped in its nearest cluster;

7) adjust cluster centre, the center of cluster is moved to geometric center (i.e. mean value) place of class;

8) repeat step (6) (7) until cluster centre no longer moves, obtain cluster result.

By such scheme, described step 1) in the computing method of Eps as follows:

wherein α is adjustment factor, and k is the number to data set classification, d (x _i, x _j) represent distance between data centralization any two points.

By such scheme, the defining method of described k value is as follows:

K = \frac{1 + \sqrt{n}}{2} .

The beneficial effect that the present invention produces is: the k mean algorithm that the present invention proposes a kind of density based and Huffman tree, optimizes initial cluster center point relative to traditional k mean algorithm, improve the accuracy of classification; Relative to the independent method based on Huffman tree, ensure that under the prerequisite at classification accuracy rate, greatly reducing the working time of algorithm.

Accompanying drawing explanation

Below in conjunction with drawings and Examples, the invention will be further described, in accompanying drawing:

Fig. 1 is the method flow diagram of the embodiment of the present invention;

Fig. 2 is the time loss comparison diagram of the embodiment of the present invention.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.

As shown in Figure 1, a kind of k mean data disposal route set based on packing density and Huffman, comprises the following steps:

Wherein Eps is given radius;

wherein α is adjustment factor (being generally 1), and k is the number to data set classification, d (x _i, x _j) representing distance between data centralization any two points, distance is Euclidean distance.

For the data set be familiar with, directly empirically, for unfamiliar data set, k value generally all meets as lower inequality in the selection of k value:

1 \leq K \leq \sqrt{n};

According to the improvement of the k value proposed, we use following expression to choose k value:

n is the number of data centralization object;

2) each some Pi average density M in data set U is calculated _av;

Wherein, n is the number of data centralization object;

3) density is greater than M _avpoint be put in set V, in set V, generate Huffman tree, 2 points that each combined distance is nearest, and represent new by the average of attribute of two objects producing this object and merge the attribute that institute produces object, put until only remain the next one in set V;

In the process building Huffman tree, the form of structure is from bottom to up, and the point that merging two is nearest is at every turn a point, until a set the inside residue point.Then from top to bottom (backward) deletes k-1 point, can obtain k point.Also can be understood as in the process building Huffman tree, the point that merging two is nearest is at every turn a point, until set the inside residue k point.If k=2, only need inverted deleting k-1=1 point (uppermost point), as cluster centre initial in next step together with a set the inside residue point.

5) k the point that setting is remaining is initial cluster centre;

The geometric center of class and barycenter, exactly in class each attribute a little average after a point obtaining.

Instantiation:

Utilize the iris data set in UCI database and seed data set to test, iris data set comprises to come as 150 of 3 kinds of flowers record samples, 50, often kind of flower, and every bar sample has 4 attribute; Seed data set comprises to come 210 record samples as 3 grow wheat kind seeds, and every grow wheat kind seed 70, each wheat breed has 7 attribute.K mean algorithm more traditional by experiment and based on the k mean algorithm (H+K) of Huffman tree and the classification accuracy rate of k mean algorithm (M+H+K) of density based and Huffman tree and time loss.The wherein correct number of samples/population sample number of exact rate=classification.

By table 1, contrast the result that three kinds of algorithms run on two kinds of different data sets to draw, the accuracy of this algorithm and be similar to identical based on the accuracy of the k mean algorithm of Huffman tree, and all high than the accuracy of traditional K mean algorithm, show that the k mean algorithm of proposed density based and Huffman tree is better than traditional k mean algorithm in accuracy.

Table 1

Contrast the time loss based on the k mean algorithm of Huffman tree and the k mean algorithm of the density based proposed and Huffman tree, select the data set of different dimension to test in UCI database, can Fig. 2 be obtained by experiment.

Pass through Fig. 2, the data set for different dimension can be found out, the time loss of the k mean algorithm of the density based proposed and Huffman tree all lower than the k mean algorithm set based on Huffman, will show that this invents the algorithm proposed on time loss, is better than the k mean algorithm set based on Huffman.

Should be understood that, for those of ordinary skills, can be improved according to the above description or convert, and all these improve and convert the protection domain that all should belong to claims of the present invention.

Claims

1., based on the k mean data disposal route that packing density and Huffman are set, comprise the following steps:

Wherein Eps is given radius;

wherein α is adjustment factor, and k is the number to data set classification, d (x _i, x _j) represent distance between data centralization any two points;

2) each some Pi average density M in data set U is calculated _av;

Wherein, n is the number of data centralization object;

3) density is greater than M _avpoint be put in set V, in set V, generate Huffman tree, 2 points that each combined distance is nearest, and replace by average, until only surplus next point in set V;

5) setting k the point obtained is initial cluster centre;

6) calculate the distance of points all in data U to this k point, point is grouped in its nearest cluster;

2. k mean data disposal route according to claim 1, is characterized in that, described step 1) in the computing method of Eps as follows:

3. k mean data disposal route according to claim 1, is characterized in that, the defining method of described k value is as follows:

K = \frac{1 + \sqrt{n}}{2} .