CN107562948A - A kind of printenv multidimensional data clustering method based on distance - Google Patents

A kind of printenv multidimensional data clustering method based on distance Download PDF

Info

Publication number
CN107562948A
CN107562948A CN201710884448.8A CN201710884448A CN107562948A CN 107562948 A CN107562948 A CN 107562948A CN 201710884448 A CN201710884448 A CN 201710884448A CN 107562948 A CN107562948 A CN 107562948A
Authority
CN
China
Prior art keywords
data
distance
distance value
data item
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710884448.8A
Other languages
Chinese (zh)
Inventor
莫毓昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710884448.8A priority Critical patent/CN107562948A/en
Publication of CN107562948A publication Critical patent/CN107562948A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to data analysis technique field, and in particular to a kind of printenv multidimensional data clustering method based on distance, it uses following method and step:Step 1:An item data x is randomly choosed from cube D;Step 2:The distance value of other each data item in data x and cube D in calculation procedure one;Step 3:The distance value calculated in all step 2 is summed respectively according to data item;Calculate each distance value and and its average difference quadratic sum;It is by the iterative analysis of value sum of adjusting the distance, the problem of overcoming the multidimensional data clustering method how appropriate selection threshold value H of conventional belt parameter, simplifies the difficulty of multidimensional data clustering method.

Description

A kind of printenv multidimensional data clustering method based on distance
【Technical field】
The present invention relates to data analysis technique field, and in particular to a kind of printenv multidimensional data clustering side based on distance Method.
【Background technology】
When analyzing multidimensional data, cluster is a kind of very important analytical technology.Cluster analysis refers to physics Or the set of abstract object is grouped into the analysis process for the multiple classes being made up of similar object.It is a kind of important mankind's row For.
The target of cluster analysis is exactly that data are collected on the basis of similar to classify.Cluster comes from many fields, including Mathematics, computer science, statistics, biology and economics.In different application fields, many clustering techniques are obtained for hair Exhibition, these technical methods are used as describing data, weigh the similitude between different data sources, and data source is categorized into difference Cluster in.
Clustering technique is that data item similar in feature in multidimensional data is included into same class.Spy between usual data item The distance between multidimensional data item x and y D can be used by levying differencex,yPortrayed:
Wherein L is the dimension of multidimensional data, yiAnd xiIt is the value of multidimensional data item x and y i-th dimension degree.
Traditional multidimensional data clustering method based on distance needs to set a threshold parameter H, and arranges in same class The distance between data item value be no more than H.For the multidimensional data clustering method with parameter, how appropriate selection threshold value H Improve the difficulty of multidimensional data clustering.
【The content of the invention】
In view of the defects and deficiencies of the prior art, the present invention intends to provide a kind of printenv multidimensional based on distance Data clustering method, it overcomes the multidimensional data clustering method of conventional belt parameter such as by the iterative analysis for value sum of adjusting the distance The problem of what appropriate selection threshold value H, simplify the difficulty of multidimensional data clustering method.
A kind of printenv multidimensional data clustering method based on distance of the present invention, it is walked using following method Suddenly:
Step 1:An item data x is randomly choosed from cube D;
Step 2:The distance value of other each data item in data x and cube D in calculation procedure one;
Step 3:The distance value calculated in all step 2 is summed respectively according to data item;Calculate it is each away from From value and and its average difference quadratic sum;
Step 4:If the quadratic sum in step 3 is smaller than the preceding quadratic sum being once calculated, according to distance value and from It is small to being ranked up to cube D greatly, and record ordering result, under then chosen distance value and maximum data item are used as The x once clustered, re-execute step 2-step 4;
Step 5:If the quadratic sum is small unlike the preceding quadratic sum being once calculated, stop sequence work;
Step 6:Multiple sequences to acquisition, data item deletion action is carried out according to the order of distance value from big to small, directly Data item into each sequence does not repeat.During deletion, if a certain data item exists only in a sequence, skip The deletion action of the data item.
After said structure, the present invention has the beneficial effect that:A kind of printenv multidimensional based on distance of the present invention Data clustering method, by the iterative analysis for value sum of adjusting the distance, overcome conventional belt parameter multidimensional data clustering method how The problem of appropriate selection threshold value H, simplify the difficulty of multidimensional data clustering method.
【Embodiment】
The present invention will be described in detail with specific embodiment below, illustrative examples therein and explanation be only used for solving The present invention is released, but it is not as a limitation of the invention.
A kind of printenv multidimensional data clustering method based on distance described in present embodiment, it uses following Method and step:
Step 1:An item data x is randomly choosed from cube D;
Step 2:The distance value of other each data item in data x and cube D in calculation procedure one;
Step 3:The distance value calculated in all step 2 is summed respectively according to data item;Calculate it is each away from From value and and its average difference quadratic sum;
Step 4:If the quadratic sum in step 3 is smaller than the preceding quadratic sum being once calculated, according to distance value and from It is small to being ranked up to cube D greatly, and record ordering result, under then chosen distance value and maximum data item are used as The x once clustered, re-execute step 2-step 4;
Step 5:If the quadratic sum is small unlike the preceding quadratic sum being once calculated, stop sequence work;
Step 6:Multiple sequences to acquisition, data item deletion action is carried out according to the order of distance value from big to small, directly Data item into each sequence does not repeat.During deletion, if a certain data item exists only in a sequence, skip The deletion action of the data item.
The present invention is specifically described with specific embodiment:
Step 1:The 3rd item data x=(2,2,2,4) of random selection from cube D (table one), is calculated in x and D The distance of other data item;
Table one:
Then the distance value having calculated that is summed respectively according to data item.
Because being to calculate distance value for the first time, distance value sum is exactly itself;Calculate each distance value and and its The quadratic sum of the difference of average:
(12.40967-9.9436)2+(16.27882-9.9436)2+(0-9.9436)2+(1.414214-9.9436)2+ (13.34166-9.9436)2+(16.21727-9.9436)2=268.7479.
Subsequently, data set D is ranked up from small to large according to distance value sum, obtains following table (table two)
Table two:
2 2 2 4
3 2 3 4
10 9 7 8
8 11 8 9
4 15 5 13
4 16 3 12
Step 2:Last item data x=(4,16,3,12) is selected in ranking results from step 1, is calculated in x and D The distance of other data item, such as following table (table three)
Table three:
Then the distance value having calculated that is summed respectively according to data item, such as following table (table four)
Table four:
Calculate each distance value and and its average difference quadratic sum:
(23.22632-19.0037)2+(16.27882-19.0037)2+(16.27882-19.0037)2+(17.5697- 19.0037)2+(22.00191-19.0037)2+(18.66676-19.0037)2=43.83961;It is less than it due to 43.83961 Preceding inequality quadratic sum 268.7479, it can continue to cluster.
Data set D is ranked up from small to large further according to distance value sum, such as following table (table five)
Table five:
4 16 3 12
2 2 2 4
3 2 3 4
4 15 5 13
8 11 8 9
10 9 7 8
Step 3:Last item data x=(10,9,7,8) is selected from the ranking results in step 2, is calculated in x and D The distance of other data item, such as following table (table six)
Table six:
Then the distance value having calculated that is summed respectively according to data item, such as following table (table seven)
Table seven:
Calculate each distance value and and its average difference quadratic sum:
(23.22632-26.9771)2+(27.09547-26.9771)2+(28.68849-26.9771)2+(28.97145- 26.9771)2+(25.16419-26.9771)2+(26.71664-28.9771)2=27.30129;It is less than it due to 27.30129 Preceding inequality quadratic sum 43.83961, it can continue to cluster.
Data set D is ranked up from small to large according to distance value sum, such as following table (table eight)
Table eight:
10 9 7 8
8 11 8 9
4 16 3 12
2 2 2 4
4 15 5 13
3 2 3 4
Step 4:Last item data x=(3,2,3,4) is selected from the ranking results in step 3, is calculated in x and D The distance of other data item, such as following table (table eight)
Table eight:
Then the distance value having calculated that is summed respectively according to data item, such as following table (table nine)
Table nine:
Calculate each distance value and and its average difference quadratic sum:
(34.62807-58.20246)2+(43.25096-58.20246)2+(30.102704-58.20246)2+ (28.97145-58.20246)2+(37.654192-58.20246)2+(44.68536-58.20246)2=16395.68393; Inequality quadratic sum 27.30129 before being more than due to 16395.68393, stopping are clustered.
Step 5:To step 1 to step 3 obtain intermediate result, table one, table three, table six, according to distance value from big to small Eliminate.The data item eliminated in following table is labeled as overstriking italic:
Table one,
Table three,
Table six
Eliminate to after the 11st data item, duplicate data item is occurred without in three tables, now stop eliminating.
The cluster result that data item obtains that do not eliminate finally retained in each table is following table (table nine, table ten, table 11)
Table nine:
Table ten:
Table 11:
A kind of printenv multidimensional data clustering method based on distance of the present invention, passes through the iteration for value sum of adjusting the distance Analysis, the problem of overcoming the multidimensional data clustering method how appropriate selection threshold value H of conventional belt parameter, simplify more dimensions According to the difficulty of clustering method.
Described above is only the better embodiment of the present invention, thus all features according to described in present patent application scope and The equivalent change or modification that principle is done, is included in the range of present patent application.

Claims (1)

  1. A kind of 1. printenv multidimensional data clustering method based on distance, it is characterised in that:It uses following method and step:
    Step 1:An item data x is randomly choosed from cube D;
    Step 2:The distance value of other each data item in data x and cube D in calculation procedure one;
    Step 3:The distance value calculated in all step 2 is summed respectively according to data item;Calculate each distance value With and its average difference quadratic sum;
    Step 4:If the quadratic sum in step 3 is smaller than the preceding quadratic sum being once calculated, according to distance value and from it is small to Cube D is ranked up greatly, and record ordering result, then chosen distance value and maximum data item are as next time The x of cluster, re-execute step 2-step 4;
    Step 5:If the quadratic sum is small unlike the preceding quadratic sum being once calculated, stop sequence work;
    Step 6:Multiple sequences to acquisition, data item deletion action is carried out according to the order of distance value from big to small, until each Data item in individual sequence does not repeat.During deletion, if a certain data item exists only in a sequence, the number is skipped According to the deletion action of item.
CN201710884448.8A 2017-09-26 2017-09-26 A kind of printenv multidimensional data clustering method based on distance Pending CN107562948A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710884448.8A CN107562948A (en) 2017-09-26 2017-09-26 A kind of printenv multidimensional data clustering method based on distance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710884448.8A CN107562948A (en) 2017-09-26 2017-09-26 A kind of printenv multidimensional data clustering method based on distance

Publications (1)

Publication Number Publication Date
CN107562948A true CN107562948A (en) 2018-01-09

Family

ID=60982853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710884448.8A Pending CN107562948A (en) 2017-09-26 2017-09-26 A kind of printenv multidimensional data clustering method based on distance

Country Status (1)

Country Link
CN (1) CN107562948A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344194A (en) * 2018-09-20 2019-02-15 北京工商大学 Pesticide residue high dimensional data visual analysis method and system based on subspace clustering
WO2019136929A1 (en) * 2018-01-13 2019-07-18 惠州学院 Data clustering method and device based on k neighborhood similarity as well as storage medium
CN110909067A (en) * 2019-10-28 2020-03-24 中南大学 Visual analysis system and method for ocean multidimensional data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019136929A1 (en) * 2018-01-13 2019-07-18 惠州学院 Data clustering method and device based on k neighborhood similarity as well as storage medium
CN109344194A (en) * 2018-09-20 2019-02-15 北京工商大学 Pesticide residue high dimensional data visual analysis method and system based on subspace clustering
CN109344194B (en) * 2018-09-20 2021-09-28 北京工商大学 Subspace clustering-based pesticide residue high-dimensional data visual analysis method and system
CN110909067A (en) * 2019-10-28 2020-03-24 中南大学 Visual analysis system and method for ocean multidimensional data

Similar Documents

Publication Publication Date Title
Chen et al. Effectively clustering by finding density backbone based-on kNN
CN107527031B (en) SSD-based indoor target detection method
CN107562948A (en) A kind of printenv multidimensional data clustering method based on distance
EP2880566B1 (en) A method for pre-processing and processing query operation on multiple data chunk on vector enabled architecture
CN108846338A (en) Polarization characteristic selection and classification method based on object-oriented random forest
CN102254033A (en) Entropy weight-based global K-means clustering method
Nama et al. Implementation of K-Means Technique in Data Mining to Cluster Researchers Google Scholar Profile
CN107563324B (en) Hyperspectral image classification method and device of ultralimit learning machine with composite nuclear structure
US8661040B2 (en) Grid-based data clustering method
CN109993070A (en) A kind of pedestrian's recognition methods again based on global distance scale loss function
Rani Visual analytics for comparing the impact of outliers in k-means and k-medoids algorithm
CN109145111B (en) Multi-feature text data similarity calculation method based on machine learning
Bellatreche et al. Dimension table driven approach to referential partition relational data warehouses
CN108664548B (en) Network access behavior characteristic group dynamic mining method and system under degradation condition
Lucchese et al. Query-level early exit for additive learning-to-rank ensembles
CN104794215A (en) Fast recursive clustering method suitable for large-scale data
Rao et al. Efficient iceberg query evaluation using set representation
Yang et al. A dynamic balanced quadtree for real-time streaming data
Koumarelas et al. Binary Theta-Joins using MapReduce: Efficiency Analysis and Improvements.
Lou et al. Research on data query optimization based on SparkSQL and MongoDB
Jiang et al. A hybrid clustering algorithm
CN101082925A (en) Rough set property reduction method based on SQL language
CN104850594A (en) Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data
Chen et al. Spatial data partitioning based on the clustering of minimum distance criterion
Kellom et al. Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180109