CN107562948A - A kind of printenv multidimensional data clustering method based on distance - Google Patents
A kind of printenv multidimensional data clustering method based on distance Download PDFInfo
- Publication number
- CN107562948A CN107562948A CN201710884448.8A CN201710884448A CN107562948A CN 107562948 A CN107562948 A CN 107562948A CN 201710884448 A CN201710884448 A CN 201710884448A CN 107562948 A CN107562948 A CN 107562948A
- Authority
- CN
- China
- Prior art keywords
- data
- distance
- distance value
- data item
- item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to data analysis technique field, and in particular to a kind of printenv multidimensional data clustering method based on distance, it uses following method and step:Step 1:An item data x is randomly choosed from cube D;Step 2:The distance value of other each data item in data x and cube D in calculation procedure one;Step 3:The distance value calculated in all step 2 is summed respectively according to data item;Calculate each distance value and and its average difference quadratic sum;It is by the iterative analysis of value sum of adjusting the distance, the problem of overcoming the multidimensional data clustering method how appropriate selection threshold value H of conventional belt parameter, simplifies the difficulty of multidimensional data clustering method.
Description
【Technical field】
The present invention relates to data analysis technique field, and in particular to a kind of printenv multidimensional data clustering side based on distance
Method.
【Background technology】
When analyzing multidimensional data, cluster is a kind of very important analytical technology.Cluster analysis refers to physics
Or the set of abstract object is grouped into the analysis process for the multiple classes being made up of similar object.It is a kind of important mankind's row
For.
The target of cluster analysis is exactly that data are collected on the basis of similar to classify.Cluster comes from many fields, including
Mathematics, computer science, statistics, biology and economics.In different application fields, many clustering techniques are obtained for hair
Exhibition, these technical methods are used as describing data, weigh the similitude between different data sources, and data source is categorized into difference
Cluster in.
Clustering technique is that data item similar in feature in multidimensional data is included into same class.Spy between usual data item
The distance between multidimensional data item x and y D can be used by levying differencex,yPortrayed:
Wherein L is the dimension of multidimensional data, yiAnd xiIt is the value of multidimensional data item x and y i-th dimension degree.
Traditional multidimensional data clustering method based on distance needs to set a threshold parameter H, and arranges in same class
The distance between data item value be no more than H.For the multidimensional data clustering method with parameter, how appropriate selection threshold value H
Improve the difficulty of multidimensional data clustering.
【The content of the invention】
In view of the defects and deficiencies of the prior art, the present invention intends to provide a kind of printenv multidimensional based on distance
Data clustering method, it overcomes the multidimensional data clustering method of conventional belt parameter such as by the iterative analysis for value sum of adjusting the distance
The problem of what appropriate selection threshold value H, simplify the difficulty of multidimensional data clustering method.
A kind of printenv multidimensional data clustering method based on distance of the present invention, it is walked using following method
Suddenly:
Step 1:An item data x is randomly choosed from cube D;
Step 2:The distance value of other each data item in data x and cube D in calculation procedure one;
Step 3:The distance value calculated in all step 2 is summed respectively according to data item;Calculate it is each away from
From value and and its average difference quadratic sum;
Step 4:If the quadratic sum in step 3 is smaller than the preceding quadratic sum being once calculated, according to distance value and from
It is small to being ranked up to cube D greatly, and record ordering result, under then chosen distance value and maximum data item are used as
The x once clustered, re-execute step 2-step 4;
Step 5:If the quadratic sum is small unlike the preceding quadratic sum being once calculated, stop sequence work;
Step 6:Multiple sequences to acquisition, data item deletion action is carried out according to the order of distance value from big to small, directly
Data item into each sequence does not repeat.During deletion, if a certain data item exists only in a sequence, skip
The deletion action of the data item.
After said structure, the present invention has the beneficial effect that:A kind of printenv multidimensional based on distance of the present invention
Data clustering method, by the iterative analysis for value sum of adjusting the distance, overcome conventional belt parameter multidimensional data clustering method how
The problem of appropriate selection threshold value H, simplify the difficulty of multidimensional data clustering method.
【Embodiment】
The present invention will be described in detail with specific embodiment below, illustrative examples therein and explanation be only used for solving
The present invention is released, but it is not as a limitation of the invention.
A kind of printenv multidimensional data clustering method based on distance described in present embodiment, it uses following
Method and step:
Step 1:An item data x is randomly choosed from cube D;
Step 2:The distance value of other each data item in data x and cube D in calculation procedure one;
Step 3:The distance value calculated in all step 2 is summed respectively according to data item;Calculate it is each away from
From value and and its average difference quadratic sum;
Step 4:If the quadratic sum in step 3 is smaller than the preceding quadratic sum being once calculated, according to distance value and from
It is small to being ranked up to cube D greatly, and record ordering result, under then chosen distance value and maximum data item are used as
The x once clustered, re-execute step 2-step 4;
Step 5:If the quadratic sum is small unlike the preceding quadratic sum being once calculated, stop sequence work;
Step 6:Multiple sequences to acquisition, data item deletion action is carried out according to the order of distance value from big to small, directly
Data item into each sequence does not repeat.During deletion, if a certain data item exists only in a sequence, skip
The deletion action of the data item.
The present invention is specifically described with specific embodiment:
Step 1:The 3rd item data x=(2,2,2,4) of random selection from cube D (table one), is calculated in x and D
The distance of other data item;
Table one:
Then the distance value having calculated that is summed respectively according to data item.
Because being to calculate distance value for the first time, distance value sum is exactly itself;Calculate each distance value and and its
The quadratic sum of the difference of average:
(12.40967-9.9436)2+(16.27882-9.9436)2+(0-9.9436)2+(1.414214-9.9436)2+
(13.34166-9.9436)2+(16.21727-9.9436)2=268.7479.
Subsequently, data set D is ranked up from small to large according to distance value sum, obtains following table (table two)
Table two:
2 | 2 | 2 | 4 |
3 | 2 | 3 | 4 |
10 | 9 | 7 | 8 |
8 | 11 | 8 | 9 |
4 | 15 | 5 | 13 |
4 | 16 | 3 | 12 |
Step 2:Last item data x=(4,16,3,12) is selected in ranking results from step 1, is calculated in x and D
The distance of other data item, such as following table (table three)
Table three:
Then the distance value having calculated that is summed respectively according to data item, such as following table (table four)
Table four:
Calculate each distance value and and its average difference quadratic sum:
(23.22632-19.0037)2+(16.27882-19.0037)2+(16.27882-19.0037)2+(17.5697-
19.0037)2+(22.00191-19.0037)2+(18.66676-19.0037)2=43.83961;It is less than it due to 43.83961
Preceding inequality quadratic sum 268.7479, it can continue to cluster.
Data set D is ranked up from small to large further according to distance value sum, such as following table (table five)
Table five:
4 | 16 | 3 | 12 |
2 | 2 | 2 | 4 |
3 | 2 | 3 | 4 |
4 | 15 | 5 | 13 |
8 | 11 | 8 | 9 |
10 | 9 | 7 | 8 |
Step 3:Last item data x=(10,9,7,8) is selected from the ranking results in step 2, is calculated in x and D
The distance of other data item, such as following table (table six)
Table six:
Then the distance value having calculated that is summed respectively according to data item, such as following table (table seven)
Table seven:
Calculate each distance value and and its average difference quadratic sum:
(23.22632-26.9771)2+(27.09547-26.9771)2+(28.68849-26.9771)2+(28.97145-
26.9771)2+(25.16419-26.9771)2+(26.71664-28.9771)2=27.30129;It is less than it due to 27.30129
Preceding inequality quadratic sum 43.83961, it can continue to cluster.
Data set D is ranked up from small to large according to distance value sum, such as following table (table eight)
Table eight:
10 | 9 | 7 | 8 |
8 | 11 | 8 | 9 |
4 | 16 | 3 | 12 |
2 | 2 | 2 | 4 |
4 | 15 | 5 | 13 |
3 | 2 | 3 | 4 |
Step 4:Last item data x=(3,2,3,4) is selected from the ranking results in step 3, is calculated in x and D
The distance of other data item, such as following table (table eight)
Table eight:
Then the distance value having calculated that is summed respectively according to data item, such as following table (table nine)
Table nine:
Calculate each distance value and and its average difference quadratic sum:
(34.62807-58.20246)2+(43.25096-58.20246)2+(30.102704-58.20246)2+
(28.97145-58.20246)2+(37.654192-58.20246)2+(44.68536-58.20246)2=16395.68393;
Inequality quadratic sum 27.30129 before being more than due to 16395.68393, stopping are clustered.
Step 5:To step 1 to step 3 obtain intermediate result, table one, table three, table six, according to distance value from big to small
Eliminate.The data item eliminated in following table is labeled as overstriking italic:
Table one,
Table three,
Table six
Eliminate to after the 11st data item, duplicate data item is occurred without in three tables, now stop eliminating.
The cluster result that data item obtains that do not eliminate finally retained in each table is following table (table nine, table ten, table 11)
Table nine:
Table ten:
Table 11:
A kind of printenv multidimensional data clustering method based on distance of the present invention, passes through the iteration for value sum of adjusting the distance
Analysis, the problem of overcoming the multidimensional data clustering method how appropriate selection threshold value H of conventional belt parameter, simplify more dimensions
According to the difficulty of clustering method.
Described above is only the better embodiment of the present invention, thus all features according to described in present patent application scope and
The equivalent change or modification that principle is done, is included in the range of present patent application.
Claims (1)
- A kind of 1. printenv multidimensional data clustering method based on distance, it is characterised in that:It uses following method and step:Step 1:An item data x is randomly choosed from cube D;Step 2:The distance value of other each data item in data x and cube D in calculation procedure one;Step 3:The distance value calculated in all step 2 is summed respectively according to data item;Calculate each distance value With and its average difference quadratic sum;Step 4:If the quadratic sum in step 3 is smaller than the preceding quadratic sum being once calculated, according to distance value and from it is small to Cube D is ranked up greatly, and record ordering result, then chosen distance value and maximum data item are as next time The x of cluster, re-execute step 2-step 4;Step 5:If the quadratic sum is small unlike the preceding quadratic sum being once calculated, stop sequence work;Step 6:Multiple sequences to acquisition, data item deletion action is carried out according to the order of distance value from big to small, until each Data item in individual sequence does not repeat.During deletion, if a certain data item exists only in a sequence, the number is skipped According to the deletion action of item.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710884448.8A CN107562948A (en) | 2017-09-26 | 2017-09-26 | A kind of printenv multidimensional data clustering method based on distance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710884448.8A CN107562948A (en) | 2017-09-26 | 2017-09-26 | A kind of printenv multidimensional data clustering method based on distance |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107562948A true CN107562948A (en) | 2018-01-09 |
Family
ID=60982853
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710884448.8A Pending CN107562948A (en) | 2017-09-26 | 2017-09-26 | A kind of printenv multidimensional data clustering method based on distance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107562948A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344194A (en) * | 2018-09-20 | 2019-02-15 | 北京工商大学 | Pesticide residue high dimensional data visual analysis method and system based on subspace clustering |
WO2019136929A1 (en) * | 2018-01-13 | 2019-07-18 | 惠州学院 | Data clustering method and device based on k neighborhood similarity as well as storage medium |
CN110909067A (en) * | 2019-10-28 | 2020-03-24 | 中南大学 | Visual analysis system and method for ocean multidimensional data |
-
2017
- 2017-09-26 CN CN201710884448.8A patent/CN107562948A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019136929A1 (en) * | 2018-01-13 | 2019-07-18 | 惠州学院 | Data clustering method and device based on k neighborhood similarity as well as storage medium |
CN109344194A (en) * | 2018-09-20 | 2019-02-15 | 北京工商大学 | Pesticide residue high dimensional data visual analysis method and system based on subspace clustering |
CN109344194B (en) * | 2018-09-20 | 2021-09-28 | 北京工商大学 | Subspace clustering-based pesticide residue high-dimensional data visual analysis method and system |
CN110909067A (en) * | 2019-10-28 | 2020-03-24 | 中南大学 | Visual analysis system and method for ocean multidimensional data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Effectively clustering by finding density backbone based-on kNN | |
CN107527031B (en) | SSD-based indoor target detection method | |
CN107562948A (en) | A kind of printenv multidimensional data clustering method based on distance | |
EP2880566B1 (en) | A method for pre-processing and processing query operation on multiple data chunk on vector enabled architecture | |
CN108846338A (en) | Polarization characteristic selection and classification method based on object-oriented random forest | |
CN102254033A (en) | Entropy weight-based global K-means clustering method | |
Nama et al. | Implementation of K-Means Technique in Data Mining to Cluster Researchers Google Scholar Profile | |
CN107563324B (en) | Hyperspectral image classification method and device of ultralimit learning machine with composite nuclear structure | |
US8661040B2 (en) | Grid-based data clustering method | |
CN109993070A (en) | A kind of pedestrian's recognition methods again based on global distance scale loss function | |
Rani | Visual analytics for comparing the impact of outliers in k-means and k-medoids algorithm | |
CN109145111B (en) | Multi-feature text data similarity calculation method based on machine learning | |
Bellatreche et al. | Dimension table driven approach to referential partition relational data warehouses | |
CN108664548B (en) | Network access behavior characteristic group dynamic mining method and system under degradation condition | |
Lucchese et al. | Query-level early exit for additive learning-to-rank ensembles | |
CN104794215A (en) | Fast recursive clustering method suitable for large-scale data | |
Rao et al. | Efficient iceberg query evaluation using set representation | |
Yang et al. | A dynamic balanced quadtree for real-time streaming data | |
Koumarelas et al. | Binary Theta-Joins using MapReduce: Efficiency Analysis and Improvements. | |
Lou et al. | Research on data query optimization based on SparkSQL and MongoDB | |
Jiang et al. | A hybrid clustering algorithm | |
CN101082925A (en) | Rough set property reduction method based on SQL language | |
CN104850594A (en) | Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data | |
Chen et al. | Spatial data partitioning based on the clustering of minimum distance criterion | |
Kellom et al. | Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180109 |