CN104731760A - K-means data processing method based on data density and Huffman tree - Google Patents

K-means data processing method based on data density and Huffman tree Download PDF

Info

Publication number
CN104731760A
CN104731760A CN201510184419.1A CN201510184419A CN104731760A CN 104731760 A CN104731760 A CN 104731760A CN 201510184419 A CN201510184419 A CN 201510184419A CN 104731760 A CN104731760 A CN 104731760A
Authority
CN
China
Prior art keywords
point
data
points
density
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510184419.1A
Other languages
Chinese (zh)
Inventor
邓燕妮
褚四勇
涂林丽
尉成勇
邓智斌
龚良文
赵东明
傅剑
刘小珠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN201510184419.1A priority Critical patent/CN104731760A/en
Publication of CN104731760A publication Critical patent/CN104731760A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a k-means data processing method based on the data density and Huffman tree. The method includes the steps of 1, calculating density of each point Pi in a data set U given; 2, calculating average density Mav of each point Pi in the data set U; 3, putting the points each having the density greater than the average density Mav, into a set V, and generating the Huffman tree in the set V; 4, reversely deleting k-1 points from the Huffman tree generated, to acquire k points; 5, setting remaining k points as an initial clustering center; 6, calculating distances from all points of the data set U to the k points, and reducing the points to a cluster the closet to them; 7, adjusting the clustering center, and moving the clustering center to a geometrical center of class; 8, repeating the steps 6 and 7 until that the clustering center never moves again, thus acquiring clustering results. The method has the advantages that classifying accuracy is guaranteed and operating time of algorithm is greatly shortened.

Description

A kind of k mean data disposal route set based on packing density and Huffman
Technical field
The present invention relates to data processing method, particularly relate to a kind of k mean data disposal route set based on packing density and Huffman.
Background technology
K mean algorithm is that MacQueen proposed first in 1967, its core concept need be divided into k class by data set to be processed, and make the similarity in class the highest, similarity between class is minimum, because the thought of its algorithm is clear, fast convergence rate, there is good retractility and used widely, but its shortcoming i.e. obvious, algorithm to the determination of the selection of initial center point and k value and isolated point more responsive, and be easily absorbed in locally optimal solution, for these shortcomings of k average, Wang Xiufen, Zhang Junwei, the method that Lu Jing etc. propose based on minimax distance selects initial center point, Li Youming, Xu Huyin, Liu Yanli etc. propose the method for density based to select the impact of initial center point and elimination isolated point, the method that Wu Xiaorong, WangShunye propose based on Huffman tree selects initial center point, the artificial fish-swarm algorithm that HaiTao Yu, XiaoxuCheng propose compensate for the shortcoming that k mean algorithm easily converges on local.But larger for data set, time dimension is higher, the time loss that also there is algorithm is very large, the problem that speed of convergence is slow.
Summary of the invention
The technical problem to be solved in the present invention is for defect of the prior art, a kind of k mean data disposal route set based on packing density and Huffman is provided, this method optimizes initial cluster center point relative to traditional k mean algorithm, improves the accuracy of classification; Relative to the independent method based on Huffman tree, ensure that under the prerequisite at classification accuracy rate, greatly reducing the working time of algorithm.
The technical solution adopted for the present invention to solve the technical problems is: a kind of k mean data disposal route set based on packing density and Huffman, comprises the following steps:
1) to given data set U, the density M (Pi) of each some Pi in data set U is calculated; The density of described any point Pi take Pi as the center of circle, and Eps is the number of the object comprised in the scope of the circle of radius;
Wherein Eps is given radius;
2) to calculate in data set U have the average density M of a Pi av;
Wherein, n is the number of data centralization object;
3) density is greater than M avpoint be put in set V, in set V, generate Huffman tree: 2 points that each combined distance is nearest, and represent new by the average of attribute of two objects producing this object and merge the attribute that institute produces object, put until only remain the next one in set V;
4) K-1 the point that the Huffman that inverted deleting generates sets, obtains k point;
5) setting k the point obtained is initial cluster centre;
6) calculate all points minute in data U and be clipped to the distance of this k point, point is grouped in its nearest cluster;
7) adjust cluster centre, the center of cluster is moved to geometric center (i.e. mean value) place of class;
8) repeat step (6) (7) until cluster centre no longer moves, obtain cluster result.
By such scheme, described step 1) in the computing method of Eps as follows:
wherein α is adjustment factor, and k is the number to data set classification, d (x i, x j) represent distance between data centralization any two points.
By such scheme, the defining method of described k value is as follows:
K = 1 + n 2 .
The beneficial effect that the present invention produces is: the k mean algorithm that the present invention proposes a kind of density based and Huffman tree, optimizes initial cluster center point relative to traditional k mean algorithm, improve the accuracy of classification; Relative to the independent method based on Huffman tree, ensure that under the prerequisite at classification accuracy rate, greatly reducing the working time of algorithm.
Accompanying drawing explanation
Below in conjunction with drawings and Examples, the invention will be further described, in accompanying drawing:
Fig. 1 is the method flow diagram of the embodiment of the present invention;
Fig. 2 is the time loss comparison diagram of the embodiment of the present invention.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
As shown in Figure 1, a kind of k mean data disposal route set based on packing density and Huffman, comprises the following steps:
1) to given data set U, the density M (Pi) of each some Pi in data set U is calculated; The density of described any point Pi take Pi as the center of circle, and Eps is the number of the object comprised in the scope of the circle of radius;
Wherein Eps is given radius;
wherein α is adjustment factor (being generally 1), and k is the number to data set classification, d (x i, x j) representing distance between data centralization any two points, distance is Euclidean distance.
For the data set be familiar with, directly empirically, for unfamiliar data set, k value generally all meets as lower inequality in the selection of k value:
1 ≤ K ≤ n ;
According to the improvement of the k value proposed, we use following expression to choose k value:
n is the number of data centralization object;
2) each some Pi average density M in data set U is calculated av;
Wherein, n is the number of data centralization object;
3) density is greater than M avpoint be put in set V, in set V, generate Huffman tree, 2 points that each combined distance is nearest, and represent new by the average of attribute of two objects producing this object and merge the attribute that institute produces object, put until only remain the next one in set V;
4) K-1 the point that the Huffman that inverted deleting generates sets, obtains k point;
In the process building Huffman tree, the form of structure is from bottom to up, and the point that merging two is nearest is at every turn a point, until a set the inside residue point.Then from top to bottom (backward) deletes k-1 point, can obtain k point.Also can be understood as in the process building Huffman tree, the point that merging two is nearest is at every turn a point, until set the inside residue k point.If k=2, only need inverted deleting k-1=1 point (uppermost point), as cluster centre initial in next step together with a set the inside residue point.
5) k the point that setting is remaining is initial cluster centre;
6) calculate all points minute in data U and be clipped to the distance of this k point, point is grouped in its nearest cluster;
7) adjust cluster centre, the center of cluster is moved to geometric center (i.e. mean value) place of class;
The geometric center of class and barycenter, exactly in class each attribute a little average after a point obtaining.
8) repeat step (6) (7) until cluster centre no longer moves, obtain cluster result.
Instantiation:
Utilize the iris data set in UCI database and seed data set to test, iris data set comprises to come as 150 of 3 kinds of flowers record samples, 50, often kind of flower, and every bar sample has 4 attribute; Seed data set comprises to come 210 record samples as 3 grow wheat kind seeds, and every grow wheat kind seed 70, each wheat breed has 7 attribute.K mean algorithm more traditional by experiment and based on the k mean algorithm (H+K) of Huffman tree and the classification accuracy rate of k mean algorithm (M+H+K) of density based and Huffman tree and time loss.The wherein correct number of samples/population sample number of exact rate=classification.
By table 1, contrast the result that three kinds of algorithms run on two kinds of different data sets to draw, the accuracy of this algorithm and be similar to identical based on the accuracy of the k mean algorithm of Huffman tree, and all high than the accuracy of traditional K mean algorithm, show that the k mean algorithm of proposed density based and Huffman tree is better than traditional k mean algorithm in accuracy.
Table 1
Contrast the time loss based on the k mean algorithm of Huffman tree and the k mean algorithm of the density based proposed and Huffman tree, select the data set of different dimension to test in UCI database, can Fig. 2 be obtained by experiment.
Pass through Fig. 2, the data set for different dimension can be found out, the time loss of the k mean algorithm of the density based proposed and Huffman tree all lower than the k mean algorithm set based on Huffman, will show that this invents the algorithm proposed on time loss, is better than the k mean algorithm set based on Huffman.
Should be understood that, for those of ordinary skills, can be improved according to the above description or convert, and all these improve and convert the protection domain that all should belong to claims of the present invention.

Claims (3)

1., based on the k mean data disposal route that packing density and Huffman are set, comprise the following steps:
1) to given data set U, the density M (Pi) of each some Pi in data set U is calculated; The density of described any point Pi take Pi as the center of circle, and Eps is the number of the object comprised in the scope of the circle of radius;
Wherein Eps is given radius;
wherein α is adjustment factor, and k is the number to data set classification, d (x i, x j) represent distance between data centralization any two points;
2) each some Pi average density M in data set U is calculated av;
Wherein, n is the number of data centralization object;
3) density is greater than M avpoint be put in set V, in set V, generate Huffman tree, 2 points that each combined distance is nearest, and replace by average, until only surplus next point in set V;
4) k-1 the point that the Huffman that inverted deleting generates sets, obtains k point;
5) setting k the point obtained is initial cluster centre;
6) calculate the distance of points all in data U to this k point, point is grouped in its nearest cluster;
7) adjust cluster centre, the center of cluster is moved to geometric center (i.e. mean value) place of class;
8) repeat step (6) (7) until cluster centre no longer moves, obtain cluster result.
2. k mean data disposal route according to claim 1, is characterized in that, described step 1) in the computing method of Eps as follows:
wherein α is adjustment factor, and k is the number to data set classification, d (x i, x j) represent distance between data centralization any two points.
3. k mean data disposal route according to claim 1, is characterized in that, the defining method of described k value is as follows:
K = 1 + n 2 .
CN201510184419.1A 2015-04-17 2015-04-17 K-means data processing method based on data density and Huffman tree Pending CN104731760A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510184419.1A CN104731760A (en) 2015-04-17 2015-04-17 K-means data processing method based on data density and Huffman tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510184419.1A CN104731760A (en) 2015-04-17 2015-04-17 K-means data processing method based on data density and Huffman tree

Publications (1)

Publication Number Publication Date
CN104731760A true CN104731760A (en) 2015-06-24

Family

ID=53455659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510184419.1A Pending CN104731760A (en) 2015-04-17 2015-04-17 K-means data processing method based on data density and Huffman tree

Country Status (1)

Country Link
CN (1) CN104731760A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204776A (en) * 2016-03-18 2017-09-26 余海箭 A kind of Web3D data compression algorithms based on floating number situation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204776A (en) * 2016-03-18 2017-09-26 余海箭 A kind of Web3D data compression algorithms based on floating number situation

Similar Documents

Publication Publication Date Title
WO2018219163A1 (en) Mapreduce-based distributed cluster processing method for large-scale data
CN108765371A (en) The dividing method of unconventional cell in a kind of pathological section
WO2019136929A1 (en) Data clustering method and device based on k neighborhood similarity as well as storage medium
CN109168177B (en) Longitude and latitude backfill method based on soft mining signaling
CN104933156A (en) Collaborative filtering method based on shared neighbor clustering
US20170236292A1 (en) Method and device for image segmentation
CN101286199A (en) Method of image segmentation based on area upgrowth and ant colony clustering
WO2015197029A1 (en) Human face similarity recognition method and system
CN103631928A (en) LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN103761311A (en) Sentiment classification method based on multi-source field instance migration
CN104240251A (en) Multi-scale point cloud noise detection method based on density analysis
CN108469263B (en) Method and system for shape point optimization based on curvature
CN108154158B (en) Building image segmentation method for augmented reality application
CN103886318A (en) Method for extracting and analyzing nidus areas in pneumoconiosis gross imaging
CN107918772A (en) Method for tracking target based on compressive sensing theory and gcForest
CN106776859A (en) Mobile solution App commending systems based on user preference
CN104992454A (en) Regionalized automatic-cluster-change image segmentation method
CN109271427A (en) A kind of clustering method based on neighbour's density and manifold distance
CN109947940A (en) File classification method, device, terminal and storage medium
Atabay et al. A clustering algorithm based on integration of K-Means and PSO
CN103714168B (en) The method and device of entry is obtained in the electronic intelligence equipment with touch-screen
CN107680099A (en) A kind of fusion IFOA and F ISODATA image partition method
CN109858612A (en) A kind of adaptive deformation cavity convolution method
KR20220034083A (en) Method and apparatus of generating font database, and method and apparatus of training neural network model, electronic device, recording medium and computer program
CN106952267B (en) Three-dimensional model set co-segmentation method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150624

RJ01 Rejection of invention patent application after publication