CN104731760A - K-means data processing method based on data density and Huffman tree - Google Patents
K-means data processing method based on data density and Huffman tree Download PDFInfo
- Publication number
- CN104731760A CN104731760A CN201510184419.1A CN201510184419A CN104731760A CN 104731760 A CN104731760 A CN 104731760A CN 201510184419 A CN201510184419 A CN 201510184419A CN 104731760 A CN104731760 A CN 104731760A
- Authority
- CN
- China
- Prior art keywords
- point
- data
- points
- density
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a k-means data processing method based on the data density and Huffman tree. The method includes the steps of 1, calculating density of each point Pi in a data set U given; 2, calculating average density Mav of each point Pi in the data set U; 3, putting the points each having the density greater than the average density Mav, into a set V, and generating the Huffman tree in the set V; 4, reversely deleting k-1 points from the Huffman tree generated, to acquire k points; 5, setting remaining k points as an initial clustering center; 6, calculating distances from all points of the data set U to the k points, and reducing the points to a cluster the closet to them; 7, adjusting the clustering center, and moving the clustering center to a geometrical center of class; 8, repeating the steps 6 and 7 until that the clustering center never moves again, thus acquiring clustering results. The method has the advantages that classifying accuracy is guaranteed and operating time of algorithm is greatly shortened.
Description
Technical field
The present invention relates to data processing method, particularly relate to a kind of k mean data disposal route set based on packing density and Huffman.
Background technology
K mean algorithm is that MacQueen proposed first in 1967, its core concept need be divided into k class by data set to be processed, and make the similarity in class the highest, similarity between class is minimum, because the thought of its algorithm is clear, fast convergence rate, there is good retractility and used widely, but its shortcoming i.e. obvious, algorithm to the determination of the selection of initial center point and k value and isolated point more responsive, and be easily absorbed in locally optimal solution, for these shortcomings of k average, Wang Xiufen, Zhang Junwei, the method that Lu Jing etc. propose based on minimax distance selects initial center point, Li Youming, Xu Huyin, Liu Yanli etc. propose the method for density based to select the impact of initial center point and elimination isolated point, the method that Wu Xiaorong, WangShunye propose based on Huffman tree selects initial center point, the artificial fish-swarm algorithm that HaiTao Yu, XiaoxuCheng propose compensate for the shortcoming that k mean algorithm easily converges on local.But larger for data set, time dimension is higher, the time loss that also there is algorithm is very large, the problem that speed of convergence is slow.
Summary of the invention
The technical problem to be solved in the present invention is for defect of the prior art, a kind of k mean data disposal route set based on packing density and Huffman is provided, this method optimizes initial cluster center point relative to traditional k mean algorithm, improves the accuracy of classification; Relative to the independent method based on Huffman tree, ensure that under the prerequisite at classification accuracy rate, greatly reducing the working time of algorithm.
The technical solution adopted for the present invention to solve the technical problems is: a kind of k mean data disposal route set based on packing density and Huffman, comprises the following steps:
1) to given data set U, the density M (Pi) of each some Pi in data set U is calculated; The density of described any point Pi take Pi as the center of circle, and Eps is the number of the object comprised in the scope of the circle of radius;
Wherein Eps is given radius;
2) to calculate in data set U have the average density M of a Pi
av;
Wherein,
n is the number of data centralization object;
3) density is greater than M
avpoint be put in set V, in set V, generate Huffman tree: 2 points that each combined distance is nearest, and represent new by the average of attribute of two objects producing this object and merge the attribute that institute produces object, put until only remain the next one in set V;
4) K-1 the point that the Huffman that inverted deleting generates sets, obtains k point;
5) setting k the point obtained is initial cluster centre;
6) calculate all points minute in data U and be clipped to the distance of this k point, point is grouped in its nearest cluster;
7) adjust cluster centre, the center of cluster is moved to geometric center (i.e. mean value) place of class;
8) repeat step (6) (7) until cluster centre no longer moves, obtain cluster result.
By such scheme, described step 1) in the computing method of Eps as follows:
wherein α is adjustment factor, and k is the number to data set classification, d (x
i, x
j) represent distance between data centralization any two points.
By such scheme, the defining method of described k value is as follows:
The beneficial effect that the present invention produces is: the k mean algorithm that the present invention proposes a kind of density based and Huffman tree, optimizes initial cluster center point relative to traditional k mean algorithm, improve the accuracy of classification; Relative to the independent method based on Huffman tree, ensure that under the prerequisite at classification accuracy rate, greatly reducing the working time of algorithm.
Accompanying drawing explanation
Below in conjunction with drawings and Examples, the invention will be further described, in accompanying drawing:
Fig. 1 is the method flow diagram of the embodiment of the present invention;
Fig. 2 is the time loss comparison diagram of the embodiment of the present invention.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
As shown in Figure 1, a kind of k mean data disposal route set based on packing density and Huffman, comprises the following steps:
1) to given data set U, the density M (Pi) of each some Pi in data set U is calculated; The density of described any point Pi take Pi as the center of circle, and Eps is the number of the object comprised in the scope of the circle of radius;
Wherein Eps is given radius;
wherein α is adjustment factor (being generally 1), and k is the number to data set classification, d (x
i, x
j) representing distance between data centralization any two points, distance is Euclidean distance.
For the data set be familiar with, directly empirically, for unfamiliar data set, k value generally all meets as lower inequality in the selection of k value:
According to the improvement of the k value proposed, we use following expression to choose k value:
n is the number of data centralization object;
2) each some Pi average density M in data set U is calculated
av;
Wherein,
n is the number of data centralization object;
3) density is greater than M
avpoint be put in set V, in set V, generate Huffman tree, 2 points that each combined distance is nearest, and represent new by the average of attribute of two objects producing this object and merge the attribute that institute produces object, put until only remain the next one in set V;
4) K-1 the point that the Huffman that inverted deleting generates sets, obtains k point;
In the process building Huffman tree, the form of structure is from bottom to up, and the point that merging two is nearest is at every turn a point, until a set the inside residue point.Then from top to bottom (backward) deletes k-1 point, can obtain k point.Also can be understood as in the process building Huffman tree, the point that merging two is nearest is at every turn a point, until set the inside residue k point.If k=2, only need inverted deleting k-1=1 point (uppermost point), as cluster centre initial in next step together with a set the inside residue point.
5) k the point that setting is remaining is initial cluster centre;
6) calculate all points minute in data U and be clipped to the distance of this k point, point is grouped in its nearest cluster;
7) adjust cluster centre, the center of cluster is moved to geometric center (i.e. mean value) place of class;
The geometric center of class and barycenter, exactly in class each attribute a little average after a point obtaining.
8) repeat step (6) (7) until cluster centre no longer moves, obtain cluster result.
Instantiation:
Utilize the iris data set in UCI database and seed data set to test, iris data set comprises to come as 150 of 3 kinds of flowers record samples, 50, often kind of flower, and every bar sample has 4 attribute; Seed data set comprises to come 210 record samples as 3 grow wheat kind seeds, and every grow wheat kind seed 70, each wheat breed has 7 attribute.K mean algorithm more traditional by experiment and based on the k mean algorithm (H+K) of Huffman tree and the classification accuracy rate of k mean algorithm (M+H+K) of density based and Huffman tree and time loss.The wherein correct number of samples/population sample number of exact rate=classification.
By table 1, contrast the result that three kinds of algorithms run on two kinds of different data sets to draw, the accuracy of this algorithm and be similar to identical based on the accuracy of the k mean algorithm of Huffman tree, and all high than the accuracy of traditional K mean algorithm, show that the k mean algorithm of proposed density based and Huffman tree is better than traditional k mean algorithm in accuracy.
Table 1
Contrast the time loss based on the k mean algorithm of Huffman tree and the k mean algorithm of the density based proposed and Huffman tree, select the data set of different dimension to test in UCI database, can Fig. 2 be obtained by experiment.
Pass through Fig. 2, the data set for different dimension can be found out, the time loss of the k mean algorithm of the density based proposed and Huffman tree all lower than the k mean algorithm set based on Huffman, will show that this invents the algorithm proposed on time loss, is better than the k mean algorithm set based on Huffman.
Should be understood that, for those of ordinary skills, can be improved according to the above description or convert, and all these improve and convert the protection domain that all should belong to claims of the present invention.
Claims (3)
1., based on the k mean data disposal route that packing density and Huffman are set, comprise the following steps:
1) to given data set U, the density M (Pi) of each some Pi in data set U is calculated; The density of described any point Pi take Pi as the center of circle, and Eps is the number of the object comprised in the scope of the circle of radius;
Wherein Eps is given radius;
wherein α is adjustment factor, and k is the number to data set classification, d (x
i, x
j) represent distance between data centralization any two points;
2) each some Pi average density M in data set U is calculated
av;
Wherein,
n is the number of data centralization object;
3) density is greater than M
avpoint be put in set V, in set V, generate Huffman tree, 2 points that each combined distance is nearest, and replace by average, until only surplus next point in set V;
4) k-1 the point that the Huffman that inverted deleting generates sets, obtains k point;
5) setting k the point obtained is initial cluster centre;
6) calculate the distance of points all in data U to this k point, point is grouped in its nearest cluster;
7) adjust cluster centre, the center of cluster is moved to geometric center (i.e. mean value) place of class;
8) repeat step (6) (7) until cluster centre no longer moves, obtain cluster result.
2. k mean data disposal route according to claim 1, is characterized in that, described step 1) in the computing method of Eps as follows:
wherein α is adjustment factor, and k is the number to data set classification, d (x
i, x
j) represent distance between data centralization any two points.
3. k mean data disposal route according to claim 1, is characterized in that, the defining method of described k value is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510184419.1A CN104731760A (en) | 2015-04-17 | 2015-04-17 | K-means data processing method based on data density and Huffman tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510184419.1A CN104731760A (en) | 2015-04-17 | 2015-04-17 | K-means data processing method based on data density and Huffman tree |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104731760A true CN104731760A (en) | 2015-06-24 |
Family
ID=53455659
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510184419.1A Pending CN104731760A (en) | 2015-04-17 | 2015-04-17 | K-means data processing method based on data density and Huffman tree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104731760A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107204776A (en) * | 2016-03-18 | 2017-09-26 | 余海箭 | A kind of Web3D data compression algorithms based on floating number situation |
-
2015
- 2015-04-17 CN CN201510184419.1A patent/CN104731760A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107204776A (en) * | 2016-03-18 | 2017-09-26 | 余海箭 | A kind of Web3D data compression algorithms based on floating number situation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018219163A1 (en) | Mapreduce-based distributed cluster processing method for large-scale data | |
CN108765371A (en) | The dividing method of unconventional cell in a kind of pathological section | |
WO2019136929A1 (en) | Data clustering method and device based on k neighborhood similarity as well as storage medium | |
CN109168177B (en) | Longitude and latitude backfill method based on soft mining signaling | |
CN104933156A (en) | Collaborative filtering method based on shared neighbor clustering | |
US20170236292A1 (en) | Method and device for image segmentation | |
CN101286199A (en) | Method of image segmentation based on area upgrowth and ant colony clustering | |
WO2015197029A1 (en) | Human face similarity recognition method and system | |
CN103631928A (en) | LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system | |
CN103761311A (en) | Sentiment classification method based on multi-source field instance migration | |
CN104240251A (en) | Multi-scale point cloud noise detection method based on density analysis | |
CN108469263B (en) | Method and system for shape point optimization based on curvature | |
CN108154158B (en) | Building image segmentation method for augmented reality application | |
CN103886318A (en) | Method for extracting and analyzing nidus areas in pneumoconiosis gross imaging | |
CN107918772A (en) | Method for tracking target based on compressive sensing theory and gcForest | |
CN106776859A (en) | Mobile solution App commending systems based on user preference | |
CN104992454A (en) | Regionalized automatic-cluster-change image segmentation method | |
CN109271427A (en) | A kind of clustering method based on neighbour's density and manifold distance | |
CN109947940A (en) | File classification method, device, terminal and storage medium | |
Atabay et al. | A clustering algorithm based on integration of K-Means and PSO | |
CN103714168B (en) | The method and device of entry is obtained in the electronic intelligence equipment with touch-screen | |
CN107680099A (en) | A kind of fusion IFOA and F ISODATA image partition method | |
CN109858612A (en) | A kind of adaptive deformation cavity convolution method | |
KR20220034083A (en) | Method and apparatus of generating font database, and method and apparatus of training neural network model, electronic device, recording medium and computer program | |
CN106952267B (en) | Three-dimensional model set co-segmentation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150624 |
|
RJ01 | Rejection of invention patent application after publication |