CN107562948A

CN107562948A - A kind of printenv multidimensional data clustering method based on distance

Info

Publication number: CN107562948A
Application number: CN201710884448.8A
Authority: CN
Inventors: 莫毓昌
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-09-26
Filing date: 2017-09-26
Publication date: 2018-01-09

Abstract

The present invention relates to data analysis technique field, and in particular to a kind of printenv multidimensional data clustering method based on distance, it uses following method and step：Step 1：An item data x is randomly choosed from cube D；Step 2：The distance value of other each data item in data x and cube D in calculation procedure one；Step 3：The distance value calculated in all step 2 is summed respectively according to data item；Calculate each distance value and and its average difference quadratic sum；It is by the iterative analysis of value sum of adjusting the distance, the problem of overcoming the multidimensional data clustering method how appropriate selection threshold value H of conventional belt parameter, simplifies the difficulty of multidimensional data clustering method.

Description

A kind of printenv multidimensional data clustering method based on distance

【Technical field】

The present invention relates to data analysis technique field, and in particular to a kind of printenv multidimensional data clustering side based on distance Method.

【Background technology】

When analyzing multidimensional data, cluster is a kind of very important analytical technology.Cluster analysis refers to physics Or the set of abstract object is grouped into the analysis process for the multiple classes being made up of similar object.It is a kind of important mankind's row For.

The target of cluster analysis is exactly that data are collected on the basis of similar to classify.Cluster comes from many fields, including Mathematics, computer science, statistics, biology and economics.In different application fields, many clustering techniques are obtained for hair Exhibition, these technical methods are used as describing data, weigh the similitude between different data sources, and data source is categorized into difference Cluster in.

Clustering technique is that data item similar in feature in multidimensional data is included into same class.Spy between usual data item The distance between multidimensional data item x and y D can be used by levying difference_x,yPortrayed：

Wherein L is the dimension of multidimensional data, y_iAnd x_iIt is the value of multidimensional data item x and y i-th dimension degree.

Traditional multidimensional data clustering method based on distance needs to set a threshold parameter H, and arranges in same class The distance between data item value be no more than H.For the multidimensional data clustering method with parameter, how appropriate selection threshold value H Improve the difficulty of multidimensional data clustering.

【The content of the invention】

In view of the defects and deficiencies of the prior art, the present invention intends to provide a kind of printenv multidimensional based on distance Data clustering method, it overcomes the multidimensional data clustering method of conventional belt parameter such as by the iterative analysis for value sum of adjusting the distance The problem of what appropriate selection threshold value H, simplify the difficulty of multidimensional data clustering method.

A kind of printenv multidimensional data clustering method based on distance of the present invention, it is walked using following method Suddenly：

Step 1：An item data x is randomly choosed from cube D；

Step 2：The distance value of other each data item in data x and cube D in calculation procedure one；

Step 3：The distance value calculated in all step 2 is summed respectively according to data item；Calculate it is each away from From value and and its average difference quadratic sum；

Step 4：If the quadratic sum in step 3 is smaller than the preceding quadratic sum being once calculated, according to distance value and from It is small to being ranked up to cube D greatly, and record ordering result, under then chosen distance value and maximum data item are used as The x once clustered, re-execute step 2-step 4；

Step 5：If the quadratic sum is small unlike the preceding quadratic sum being once calculated, stop sequence work；

Step 6：Multiple sequences to acquisition, data item deletion action is carried out according to the order of distance value from big to small, directly Data item into each sequence does not repeat.During deletion, if a certain data item exists only in a sequence, skip The deletion action of the data item.

After said structure, the present invention has the beneficial effect that：A kind of printenv multidimensional based on distance of the present invention Data clustering method, by the iterative analysis for value sum of adjusting the distance, overcome conventional belt parameter multidimensional data clustering method how The problem of appropriate selection threshold value H, simplify the difficulty of multidimensional data clustering method.

【Embodiment】

The present invention will be described in detail with specific embodiment below, illustrative examples therein and explanation be only used for solving The present invention is released, but it is not as a limitation of the invention.

A kind of printenv multidimensional data clustering method based on distance described in present embodiment, it uses following Method and step：

Step 1：An item data x is randomly choosed from cube D；

The present invention is specifically described with specific embodiment：

Step 1：The 3rd item data x=(2,2,2,4) of random selection from cube D (table one), is calculated in x and D The distance of other data item；

Table one：

Then the distance value having calculated that is summed respectively according to data item.

Because being to calculate distance value for the first time, distance value sum is exactly itself；Calculate each distance value and and its The quadratic sum of the difference of average：

(12.40967-9.9436)²+(16.27882-9.9436)²+(0-9.9436)²+(1.414214-9.9436)²+ (13.34166-9.9436)²+(16.21727-9.9436)²=268.7479.

Subsequently, data set D is ranked up from small to large according to distance value sum, obtains following table (table two)

Table two：

2	2	2	4
				3	2	3	4
10	9	7	8
				8	11	8	9
4	15	5	13
				4	16	3	12

Step 2：Last item data x=(4,16,3,12) is selected in ranking results from step 1, is calculated in x and D The distance of other data item, such as following table (table three)

Table three：

Then the distance value having calculated that is summed respectively according to data item, such as following table (table four)

Table four：

Calculate each distance value and and its average difference quadratic sum：

(23.22632-19.0037)²+(16.27882-19.0037)²+(16.27882-19.0037)²+(17.5697- 19.0037)²+(22.00191-19.0037)²+(18.66676-19.0037)²=43.83961；It is less than it due to 43.83961 Preceding inequality quadratic sum 268.7479, it can continue to cluster.

Data set D is ranked up from small to large further according to distance value sum, such as following table (table five)

Table five：

4	16	3	12
				2	2	2	4
3	2	3	4
				4	15	5	13
8	11	8	9
				10	9	7	8

Step 3：Last item data x=(10,9,7,8) is selected from the ranking results in step 2, is calculated in x and D The distance of other data item, such as following table (table six)

Table six：

Then the distance value having calculated that is summed respectively according to data item, such as following table (table seven)

Table seven：

Calculate each distance value and and its average difference quadratic sum：

(23.22632-26.9771)²+(27.09547-26.9771)²+(28.68849-26.9771)²+(28.97145- 26.9771)²+(25.16419-26.9771)²+(26.71664-28.9771)²=27.30129；It is less than it due to 27.30129 Preceding inequality quadratic sum 43.83961, it can continue to cluster.

Data set D is ranked up from small to large according to distance value sum, such as following table (table eight)

Table eight：

10	9	7	8
				8	11	8	9
4	16	3	12
				2	2	2	4
4	15	5	13
				3	2	3	4

Step 4：Last item data x=(3,2,3,4) is selected from the ranking results in step 3, is calculated in x and D The distance of other data item, such as following table (table eight)

Table eight：

Then the distance value having calculated that is summed respectively according to data item, such as following table (table nine)

Table nine：

Calculate each distance value and and its average difference quadratic sum：

(34.62807-58.20246)²+(43.25096-58.20246)²+(30.102704-58.20246)²+ (28.97145-58.20246)²+(37.654192-58.20246)²+(44.68536-58.20246)²=16395.68393； Inequality quadratic sum 27.30129 before being more than due to 16395.68393, stopping are clustered.

Step 5：To step 1 to step 3 obtain intermediate result, table one, table three, table six, according to distance value from big to small Eliminate.The data item eliminated in following table is labeled as overstriking italic：

Table one,

Table three,

Table six

Eliminate to after the 11st data item, duplicate data item is occurred without in three tables, now stop eliminating.

The cluster result that data item obtains that do not eliminate finally retained in each table is following table (table nine, table ten, table 11)

Table nine：

Table ten：

Table 11：

A kind of printenv multidimensional data clustering method based on distance of the present invention, passes through the iteration for value sum of adjusting the distance Analysis, the problem of overcoming the multidimensional data clustering method how appropriate selection threshold value H of conventional belt parameter, simplify more dimensions According to the difficulty of clustering method.

Described above is only the better embodiment of the present invention, thus all features according to described in present patent application scope and The equivalent change or modification that principle is done, is included in the range of present patent application.

Claims

A kind of 1. printenv multidimensional data clustering method based on distance, it is characterised in that：It uses following method and step：

Step 1：An item data x is randomly choosed from cube D；

Step 2：The distance value of other each data item in data x and cube D in calculation procedure one；

Step 3：The distance value calculated in all step 2 is summed respectively according to data item；Calculate each distance value With and its average difference quadratic sum；

Step 4：If the quadratic sum in step 3 is smaller than the preceding quadratic sum being once calculated, according to distance value and from it is small to Cube D is ranked up greatly, and record ordering result, then chosen distance value and maximum data item are as next time The x of cluster, re-execute step 2-step 4；

Step 5：If the quadratic sum is small unlike the preceding quadratic sum being once calculated, stop sequence work；

Step 6：Multiple sequences to acquisition, data item deletion action is carried out according to the order of distance value from big to small, until each Data item in individual sequence does not repeat.During deletion, if a certain data item exists only in a sequence, the number is skipped According to the deletion action of item.