CN107506778A

CN107506778A - A kind of mass data clustering processing method based on minimum spanning tree

Info

Publication number: CN107506778A
Application number: CN201710467400.7A
Authority: CN
Inventors: 程林; 贺海磊; 刘满君; 周勤勇; 张彦涛; 梁才浩; 刘琛; 江轶
Original assignee: Tsinghua University; State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Tsinghua University; State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2017-06-20
Filing date: 2017-06-20
Publication date: 2017-12-22

Abstract

A kind of mass data clustering processing method based on minimum spanning tree of the present invention, belongs to taxology and data mining algorithm technical field.This method establishes the mass data tree of full-mesh according to the Prim algorithm of minimum spanning tree；Applicable distance metric is determined according to the physical significance of mass data, and the side right value of mass data minimum spanning tree is determined according to distance metric rule；According to the corresponding node incidence matrix of mass data minimum spanning tree structural generation, pass through the redundant data in symmetrical treatment deletion of node incidence matrix；With reference to the weight diversity factor of the mass data incidence matrix difference several sections of points of computed altitude, and it is ranked up according to magnitude relationship；The longer sides of node weights diversity factor higher point are removed according to the actual physical meaning of mass data, so as to obtain the sample point cluster of ideal quantity.The present invention can carry out data clusters processing to mass data, can reduce the difficulty of subsequent data analysis.

Description

A kind of mass data clustering processing method based on minimum spanning tree

Technical field

A kind of mass data clustering processing method based on minimum spanning tree of the present invention, belongs to taxology and data mining is calculated Law technology field.

Background technology

With the progress of computer science, because sample size is huge in increasing data analysis, sample point is difficult to Described according to unified distribution form, it is therefore desirable to carry out the data clusters processing of early stage.Cluster is will be a series of related Data object is combined with each other, so as to form the method for the stronger data acquisition system of some dependency relations so that in each data set Multiple objects in conjunction have each other more closely contacts relation.

Currently used clustering method includes k-means clustering methods, hierarchical clustering method and fuzzy clustering algorithm etc..These The most selections for depending on original state of method, cluster is judged according to completely in accordance with the distance between sample point measurement, right Some have the sample clustering effect of specific physical significance and bad.

But in the application scenarios of machine learning field and correlation, the large-scale data instruction of physical significance usually occurs Practice the scene of model, if not carrying out clustering processing to data, then training process internally stores to hardware and calculating speed is equal There is higher requirement.In addition, the cluster result obtained by commonly using clustering method is difficult to the concept for incorporating physical significance, therefore obtain The result arrived is often unsatisfactory, causes subsequent data analysis to be worked with model training and larger error occurs, and then correlation is ground The development studied carefully causes heavy losses.In order to improve this problem, it is necessary to be improved on the basis of traditional clustering method, use The new mode sample data stronger to physical significance is handled, so as to obtain preferable data result.

It is one of feasible method for avoiding above-mentioned error that manual decision's auxiliary is added in clustering algorithm.Conventional clustering method Process is single, the tediously long complexity of calculating process, it is difficult to incorporates the influence of artificial judgment decision-making, therefore most your pupil has been used in this patent A kind of improved sample point clustering method is devised into tree algorithm.Minimal spanning tree algorithm is the conventional algorithm in planning application field One of, the minimum spanning tree by calculating multiple nodes can realize that construction cost or other various aspects of performance are most in engineer applied Excellent design, and the tree of data point can be established, because the characteristics of its is concise, facilitates policymaker to be analyzed, Therefore it is suitable for being lifted the process performance of clustering method.

Weather condition information is usually used in the output forecasting problem of all kinds of distributed power sources in electric power research field.However, Due to more than weather condition species and data are numerous and diverse, therefore directly applied in can not actually calculating again.

The content of the invention

The purpose of the present invention is to propose to a kind of mass data clustering processing method based on minimum spanning tree, pass through Puli's nurse The clustering processing of mass data is realized in algorithm and human assistance decision-making, so as to provide support for subsequent data analysis work.

Mass data clustering processing method proposed by the present invention based on minimum spanning tree, comprises the following steps：

(1) pending mass data U is converted into node matrix equation A；

It is dist () to set the distance between any two data in pending mass data U, by the distance Assignment of the dist () as matrix A, corresponding with node matrix equation A is a full-mesh figure, the side right weight of full-mesh figure For dist (), and by distance dist () as the side right weight between any two data, if pending magnanimity The number of data is m, then node matrix equation A is shown below：

(2) one minimum edge weight node sparse matrix is obtained to node matrix equation A processing using Puli's nurse method A_m：

A_m=L_m+U_m

With above-mentioned node sparse matrix A_mCorresponding is a minimum spanning tree, wherein L_mFor A_mThe latter half, U_mFor A_mTop half；

(3) matrix L in above-mentioned steps (2) is counted respectively_mThe i-th row and i-th row in the node i in minimum spanning tree The quantity D (U) on connected side, and the quantity D (U) is designated as to the degree of respective nodes in node matrix equation A；

(4) according to above-mentioned quantity D (U), using following formula, the weight difference on the side that calculating is connected with nodes of the D (U) more than 2 Measurement

Wherein, j and k is respectively the node being connected in the minimum spanning tree of step (2) with node i；

(5) the cluster value n of a mass data clustering processing is set, according to above-mentioned weight difference measurementSize, it is right Respective nodes are ranked up, and obtain a sequence node, by the maximum side of side right weight in the preceding n-1 node of sequence node from upper State and deleted in the minimum spanning tree of step (2), obtain n mutually disjunct trees, each the node in tree forms a data and gathered Class, n data clusters are obtained, that is, complete the mass data clustering processing based on minimum spanning tree.

Mass data clustering processing method proposed by the present invention based on minimum spanning tree, is characterized in：

Prim algorithm of the invention by calculating minimum spanning tree, it is proposed that the clustering processing method of mass data, Suitable for the clustering processing of the stronger multidimensional sample data of physical significance.Due to having used minimal spanning tree algorithm, therefore the party The distance metric that method can be combined in calculating process in clustering technique is parsed as tree Zhi Quanchong, overall so as to generate System minimum spanning tree the most close.On this basis, can be in order to policymaker with reference to clear concise sample point tree Appropriate human assistance amendment is carried out, finally gives preferable sample cluster classification.

The present invention has advantages below：

1st, it is of the invention to have used the Prim algorithm for establishing minimum spanning tree, compared to traditional clustering method, this method Clustering Effect it is unrelated with the selection of initial point, and the contact of the data point inside each sample cluster is even closer, thus Can more it be stablized in actual use and excellent Clustering Effect.

2nd, manual decision's auxiliary, the sample point stronger to physical significance are added in the present invention on the basis of clustering method Can be corrected during clustering processing by the result of decision, error band is clustered to reduce in follow-up data analysis work Loss, the accuracy of lift scheme training.

Brief description of the drawings

Fig. 1 (a) is the clustering distribution schematic diagram that minimum spanning tree directly removes longest edge.

Fig. 1 (b) is the clustering distribution schematic diagram of the inventive method.

Fig. 1 (c) is the clustering distribution schematic diagram of conventional clustering method.

Fig. 2 is the Clustering Effect contrast of the inventive method and common method.

Embodiment

Mass data clustering processing method proposed by the present invention based on minimum spanning tree, it can be used for that there is stronger physics In terms of the clustering processing of the multidimensional data sample point of meaning, this method comprises the following steps：

(1) determine that weather condition sample point minimum spanning tree follows in this method is Prim algorithm and correlometer algorithm Then, establish the minimum spanning tree of sample point and multiple sample clusters are generated by calculate node weight diversity factor.By pending magnanimity Data U is converted into node matrix equation A；

A_m=L_m+U_m

With above-mentioned node sparse matrix A_mCorresponding is a minimum spanning tree,

Wherein L_mFor A_mThe latter half, U_mFor A_mTop half, due to minimum edge weight node sparse matrix A_mTo be right Claim matrix, therefore only take its latter half L_mAnalyzed；

(5) the cluster value n of a mass data clustering processing is set, according to above-mentioned weight difference measurementSize, it is right Respective nodes are ranked up, and obtain a sequence node, by the maximum side of side right weight in the preceding n-1 node of sequence node from upper State and deleted in the minimum spanning tree of step (2), obtain n mutually disjunct trees, each the node in tree forms a data and gathered Class, n data clusters are obtained, that is, complete the mass data clustering processing based on minimum spanning tree, Fig. 1 (b) show according to The sample point clustering distribution schematic diagram that the inventive method obtains.The sample point clustering distribution such as Fig. 1 obtained according to the inventive method (b) shown in, Fig. 1 (a) is the clustering distribution situation of removal longest edge after generation sample point minimum spanning tree, and Fig. 1 (c) is k- The clustering distribution situation of means methods.

The performance comparison of cluster is as shown in Fig. 2 it can be seen that the Clustering Effect of the inventive method refers in DB indexes Mark (Davies-Bouldin Index, BDI), Dunn indexes index (Dunn Index, DI) and weighted index index (Weight Index, WI) on have more excellent performance, therefore can be widely applied to real data processing operating process in.

Claims

A kind of 1. mass data clustering processing method based on minimum spanning tree, it is characterised in that this method comprises the following steps：

(1) pending mass data U is converted into node matrix equation A；

It is dist () to set the distance between any two data in pending mass data U, by distance dist The assignment of () as matrix A, corresponding with node matrix equation A is a full-mesh figure, and the side right weight of full-mesh figure is Dist (), and by distance dist () as the side right weight between any two data, if pending magnanimity number According to number be m, then node matrix equation A is shown below：

(2) one minimum edge weight node sparse matrix A is obtained to node matrix equation A processing using Puli's nurse method_m：

A_m=L_m+U_m

With above-mentioned node sparse matrix A_mCorresponding is a minimum spanning tree, wherein L_mFor A_mThe latter half, U_mFor A_m's Top half；

(3) matrix L in above-mentioned steps (2) is counted respectively_mThe i-th row and i-th row in be connected with the node i in minimum spanning tree Side quantity D (U), and the quantity D (U) is designated as to the degree of respective nodes in node matrix equation A；

(4) according to above-mentioned quantity D (U), using following formula, the weight difference measurement on the side that calculating is connected with nodes of the D (U) more than 2 θ：

<mrow> <mi>&theta;</mi> <mrow> <mo>(</mo> <mi>U</mi> <mo>(</mo> <mi>i</mi> <mo>)</mo> <mo>)</mo> </mrow> <mo>=</mo> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mrow> <mo>(</mo> <mfrac> <mrow> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> <msup> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mrow> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> <msup> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>&le;</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>U</mi> <mo>(</mo> <mi>i</mi> <mo>)</mo> <mo>)</mo> </mrow> </mrow>

Wherein, j and k is respectively the node being connected in the minimum spanning tree of step (2) with node i；

(5) the cluster value n of a mass data clustering processing is set, according to above-mentioned weight difference measurement θ size, to corresponding section Put and be ranked up, obtain a sequence node, side right in the preceding n-1 node of sequence node is weighed into maximum side from above-mentioned steps (2) deleted in minimum spanning tree, obtain n mutually disjunct trees, each the node in tree forms a data clusters, there are To n data clusters, that is, complete the mass data clustering processing based on minimum spanning tree.