CN107506778A - A kind of mass data clustering processing method based on minimum spanning tree - Google Patents

A kind of mass data clustering processing method based on minimum spanning tree Download PDF

Info

Publication number
CN107506778A
CN107506778A CN201710467400.7A CN201710467400A CN107506778A CN 107506778 A CN107506778 A CN 107506778A CN 201710467400 A CN201710467400 A CN 201710467400A CN 107506778 A CN107506778 A CN 107506778A
Authority
CN
China
Prior art keywords
node
mrow
mass data
spanning tree
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710467400.7A
Other languages
Chinese (zh)
Inventor
程林
贺海磊
刘满君
周勤勇
张彦涛
梁才浩
刘琛
江轶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Tsinghua University
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, State Grid Corp of China SGCC, China Electric Power Research Institute Co Ltd CEPRI, Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd filed Critical Tsinghua University
Priority to CN201710467400.7A priority Critical patent/CN107506778A/en
Publication of CN107506778A publication Critical patent/CN107506778A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of mass data clustering processing method based on minimum spanning tree of the present invention, belongs to taxology and data mining algorithm technical field.This method establishes the mass data tree of full-mesh according to the Prim algorithm of minimum spanning tree;Applicable distance metric is determined according to the physical significance of mass data, and the side right value of mass data minimum spanning tree is determined according to distance metric rule;According to the corresponding node incidence matrix of mass data minimum spanning tree structural generation, pass through the redundant data in symmetrical treatment deletion of node incidence matrix;With reference to the weight diversity factor of the mass data incidence matrix difference several sections of points of computed altitude, and it is ranked up according to magnitude relationship;The longer sides of node weights diversity factor higher point are removed according to the actual physical meaning of mass data, so as to obtain the sample point cluster of ideal quantity.The present invention can carry out data clusters processing to mass data, can reduce the difficulty of subsequent data analysis.

Description

A kind of mass data clustering processing method based on minimum spanning tree
Technical field
A kind of mass data clustering processing method based on minimum spanning tree of the present invention, belongs to taxology and data mining is calculated Law technology field.
Background technology
With the progress of computer science, because sample size is huge in increasing data analysis, sample point is difficult to Described according to unified distribution form, it is therefore desirable to carry out the data clusters processing of early stage.Cluster is will be a series of related Data object is combined with each other, so as to form the method for the stronger data acquisition system of some dependency relations so that in each data set Multiple objects in conjunction have each other more closely contacts relation.
Currently used clustering method includes k-means clustering methods, hierarchical clustering method and fuzzy clustering algorithm etc..These The most selections for depending on original state of method, cluster is judged according to completely in accordance with the distance between sample point measurement, right Some have the sample clustering effect of specific physical significance and bad.
But in the application scenarios of machine learning field and correlation, the large-scale data instruction of physical significance usually occurs Practice the scene of model, if not carrying out clustering processing to data, then training process internally stores to hardware and calculating speed is equal There is higher requirement.In addition, the cluster result obtained by commonly using clustering method is difficult to the concept for incorporating physical significance, therefore obtain The result arrived is often unsatisfactory, causes subsequent data analysis to be worked with model training and larger error occurs, and then correlation is ground The development studied carefully causes heavy losses.In order to improve this problem, it is necessary to be improved on the basis of traditional clustering method, use The new mode sample data stronger to physical significance is handled, so as to obtain preferable data result.
It is one of feasible method for avoiding above-mentioned error that manual decision's auxiliary is added in clustering algorithm.Conventional clustering method Process is single, the tediously long complexity of calculating process, it is difficult to incorporates the influence of artificial judgment decision-making, therefore most your pupil has been used in this patent A kind of improved sample point clustering method is devised into tree algorithm.Minimal spanning tree algorithm is the conventional algorithm in planning application field One of, the minimum spanning tree by calculating multiple nodes can realize that construction cost or other various aspects of performance are most in engineer applied Excellent design, and the tree of data point can be established, because the characteristics of its is concise, facilitates policymaker to be analyzed, Therefore it is suitable for being lifted the process performance of clustering method.
Weather condition information is usually used in the output forecasting problem of all kinds of distributed power sources in electric power research field.However, Due to more than weather condition species and data are numerous and diverse, therefore directly applied in can not actually calculating again.
The content of the invention
The purpose of the present invention is to propose to a kind of mass data clustering processing method based on minimum spanning tree, pass through Puli's nurse The clustering processing of mass data is realized in algorithm and human assistance decision-making, so as to provide support for subsequent data analysis work.
Mass data clustering processing method proposed by the present invention based on minimum spanning tree, comprises the following steps:
(1) pending mass data U is converted into node matrix equation A;
It is dist () to set the distance between any two data in pending mass data U, by the distance Assignment of the dist () as matrix A, corresponding with node matrix equation A is a full-mesh figure, the side right weight of full-mesh figure For dist (), and by distance dist () as the side right weight between any two data, if pending magnanimity The number of data is m, then node matrix equation A is shown below:
(2) one minimum edge weight node sparse matrix is obtained to node matrix equation A processing using Puli's nurse method Am
Am=Lm+Um
With above-mentioned node sparse matrix AmCorresponding is a minimum spanning tree, wherein LmFor AmThe latter half, UmFor AmTop half;
(3) matrix L in above-mentioned steps (2) is counted respectivelymThe i-th row and i-th row in the node i in minimum spanning tree The quantity D (U) on connected side, and the quantity D (U) is designated as to the degree of respective nodes in node matrix equation A;
(4) according to above-mentioned quantity D (U), using following formula, the weight difference on the side that calculating is connected with nodes of the D (U) more than 2 Measurement
Wherein, j and k is respectively the node being connected in the minimum spanning tree of step (2) with node i;
(5) the cluster value n of a mass data clustering processing is set, according to above-mentioned weight difference measurementSize, it is right Respective nodes are ranked up, and obtain a sequence node, by the maximum side of side right weight in the preceding n-1 node of sequence node from upper State and deleted in the minimum spanning tree of step (2), obtain n mutually disjunct trees, each the node in tree forms a data and gathered Class, n data clusters are obtained, that is, complete the mass data clustering processing based on minimum spanning tree.
Mass data clustering processing method proposed by the present invention based on minimum spanning tree, is characterized in:
Prim algorithm of the invention by calculating minimum spanning tree, it is proposed that the clustering processing method of mass data, Suitable for the clustering processing of the stronger multidimensional sample data of physical significance.Due to having used minimal spanning tree algorithm, therefore the party The distance metric that method can be combined in calculating process in clustering technique is parsed as tree Zhi Quanchong, overall so as to generate System minimum spanning tree the most close.On this basis, can be in order to policymaker with reference to clear concise sample point tree Appropriate human assistance amendment is carried out, finally gives preferable sample cluster classification.
The present invention has advantages below:
1st, it is of the invention to have used the Prim algorithm for establishing minimum spanning tree, compared to traditional clustering method, this method Clustering Effect it is unrelated with the selection of initial point, and the contact of the data point inside each sample cluster is even closer, thus Can more it be stablized in actual use and excellent Clustering Effect.
2nd, manual decision's auxiliary, the sample point stronger to physical significance are added in the present invention on the basis of clustering method Can be corrected during clustering processing by the result of decision, error band is clustered to reduce in follow-up data analysis work Loss, the accuracy of lift scheme training.
Brief description of the drawings
Fig. 1 (a) is the clustering distribution schematic diagram that minimum spanning tree directly removes longest edge.
Fig. 1 (b) is the clustering distribution schematic diagram of the inventive method.
Fig. 1 (c) is the clustering distribution schematic diagram of conventional clustering method.
Fig. 2 is the Clustering Effect contrast of the inventive method and common method.
Embodiment
Mass data clustering processing method proposed by the present invention based on minimum spanning tree, it can be used for that there is stronger physics In terms of the clustering processing of the multidimensional data sample point of meaning, this method comprises the following steps:
(1) determine that weather condition sample point minimum spanning tree follows in this method is Prim algorithm and correlometer algorithm Then, establish the minimum spanning tree of sample point and multiple sample clusters are generated by calculate node weight diversity factor.By pending magnanimity Data U is converted into node matrix equation A;
It is dist () to set the distance between any two data in pending mass data U, by the distance Assignment of the dist () as matrix A, corresponding with node matrix equation A is a full-mesh figure, the side right weight of full-mesh figure For dist (), and by distance dist () as the side right weight between any two data, if pending magnanimity The number of data is m, then node matrix equation A is shown below:
(2) one minimum edge weight node sparse matrix is obtained to node matrix equation A processing using Puli's nurse method Am
Am=Lm+Um
With above-mentioned node sparse matrix AmCorresponding is a minimum spanning tree,
Wherein LmFor AmThe latter half, UmFor AmTop half, due to minimum edge weight node sparse matrix AmTo be right Claim matrix, therefore only take its latter half LmAnalyzed;
(3) matrix L in above-mentioned steps (2) is counted respectivelymThe i-th row and i-th row in the node i in minimum spanning tree The quantity D (U) on connected side, and the quantity D (U) is designated as to the degree of respective nodes in node matrix equation A;
(4) according to above-mentioned quantity D (U), using following formula, the weight difference on the side that calculating is connected with nodes of the D (U) more than 2 Measurement
Wherein, j and k is respectively the node being connected in the minimum spanning tree of step (2) with node i;
(5) the cluster value n of a mass data clustering processing is set, according to above-mentioned weight difference measurementSize, it is right Respective nodes are ranked up, and obtain a sequence node, by the maximum side of side right weight in the preceding n-1 node of sequence node from upper State and deleted in the minimum spanning tree of step (2), obtain n mutually disjunct trees, each the node in tree forms a data and gathered Class, n data clusters are obtained, that is, complete the mass data clustering processing based on minimum spanning tree, Fig. 1 (b) show according to The sample point clustering distribution schematic diagram that the inventive method obtains.The sample point clustering distribution such as Fig. 1 obtained according to the inventive method (b) shown in, Fig. 1 (a) is the clustering distribution situation of removal longest edge after generation sample point minimum spanning tree, and Fig. 1 (c) is k- The clustering distribution situation of means methods.
The performance comparison of cluster is as shown in Fig. 2 it can be seen that the Clustering Effect of the inventive method refers in DB indexes Mark (Davies-Bouldin Index, BDI), Dunn indexes index (Dunn Index, DI) and weighted index index (Weight Index, WI) on have more excellent performance, therefore can be widely applied to real data processing operating process in.

Claims (1)

  1. A kind of 1. mass data clustering processing method based on minimum spanning tree, it is characterised in that this method comprises the following steps:
    (1) pending mass data U is converted into node matrix equation A;
    It is dist () to set the distance between any two data in pending mass data U, by distance dist The assignment of () as matrix A, corresponding with node matrix equation A is a full-mesh figure, and the side right weight of full-mesh figure is Dist (), and by distance dist () as the side right weight between any two data, if pending magnanimity number According to number be m, then node matrix equation A is shown below:
    (2) one minimum edge weight node sparse matrix A is obtained to node matrix equation A processing using Puli's nurse methodm
    Am=Lm+Um
    With above-mentioned node sparse matrix AmCorresponding is a minimum spanning tree, wherein LmFor AmThe latter half, UmFor Am's Top half;
    (3) matrix L in above-mentioned steps (2) is counted respectivelymThe i-th row and i-th row in be connected with the node i in minimum spanning tree Side quantity D (U), and the quantity D (U) is designated as to the degree of respective nodes in node matrix equation A;
    (4) according to above-mentioned quantity D (U), using following formula, the weight difference measurement on the side that calculating is connected with nodes of the D (U) more than 2 θ:
    <mrow> <mi>&amp;theta;</mi> <mrow> <mo>(</mo> <mi>U</mi> <mo>(</mo> <mi>i</mi> <mo>)</mo> <mo>)</mo> </mrow> <mo>=</mo> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mrow> <mo>(</mo> <mfrac> <mrow> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> <msup> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mrow> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> <msup> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>,</mo> <mn>1</mn> <mo>&amp;le;</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>&amp;le;</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>U</mi> <mo>(</mo> <mi>i</mi> <mo>)</mo> <mo>)</mo> </mrow> </mrow>
    Wherein, j and k is respectively the node being connected in the minimum spanning tree of step (2) with node i;
    (5) the cluster value n of a mass data clustering processing is set, according to above-mentioned weight difference measurement θ size, to corresponding section Put and be ranked up, obtain a sequence node, side right in the preceding n-1 node of sequence node is weighed into maximum side from above-mentioned steps (2) deleted in minimum spanning tree, obtain n mutually disjunct trees, each the node in tree forms a data clusters, there are To n data clusters, that is, complete the mass data clustering processing based on minimum spanning tree.
CN201710467400.7A 2017-06-20 2017-06-20 A kind of mass data clustering processing method based on minimum spanning tree Pending CN107506778A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710467400.7A CN107506778A (en) 2017-06-20 2017-06-20 A kind of mass data clustering processing method based on minimum spanning tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710467400.7A CN107506778A (en) 2017-06-20 2017-06-20 A kind of mass data clustering processing method based on minimum spanning tree

Publications (1)

Publication Number Publication Date
CN107506778A true CN107506778A (en) 2017-12-22

Family

ID=60678437

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710467400.7A Pending CN107506778A (en) 2017-06-20 2017-06-20 A kind of mass data clustering processing method based on minimum spanning tree

Country Status (1)

Country Link
CN (1) CN107506778A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116060A (en) * 2022-08-25 2022-09-27 深圳前海环融联易信息科技服务有限公司 Key value file processing method, device, equipment, medium and computer program product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116060A (en) * 2022-08-25 2022-09-27 深圳前海环融联易信息科技服务有限公司 Key value file processing method, device, equipment, medium and computer program product

Similar Documents

Publication Publication Date Title
CN102521656B (en) Integrated transfer learning method for classification of unbalance samples
CN109871995B (en) Quantum optimization parameter adjusting method for distributed deep learning under Spark framework
CN111242282B (en) Deep learning model training acceleration method based on end edge cloud cooperation
CN104573879A (en) Photovoltaic power station output predicting method based on optimal similar day set
CN107705556A (en) A kind of traffic flow forecasting method combined based on SVMs and BP neural network
CN107480696A (en) A kind of disaggregated model construction method, device and terminal device
CN108765180A (en) The overlapping community discovery method extended with seed based on influence power
CN110765582B (en) Self-organization center K-means microgrid scene division method based on Markov chain
CN108182316B (en) Electromagnetic simulation method based on artificial intelligence and electromagnetic brain thereof
CN107194818A (en) Label based on pitch point importance propagates community discovery algorithm
CN104050547A (en) Non-linear optimization decision-making method of planning schemes for oilfield development
CN110059765B (en) Intelligent mineral identification and classification system and method
CN110096630A (en) Big data processing method of the one kind based on clustering
CN114021483A (en) Ultra-short-term wind power prediction method based on time domain characteristics and XGboost
CN116484495A (en) Pneumatic data fusion modeling method based on test design
CN105373846A (en) Oil gas gathering and transferring pipe network topological structure intelligent optimization method based on grading strategy
CN109977977A (en) A kind of method and corresponding intrument identifying potential user
CN107506778A (en) A kind of mass data clustering processing method based on minimum spanning tree
CN110956010B (en) Large-scale new energy access power grid stability identification method based on gradient lifting tree
CN107133348A (en) Extensive picture concentrates the proximity search method based on semantic consistency
CN110276478B (en) Short-term wind power prediction method based on segmented ant colony algorithm optimization SVM
CN104462853A (en) Population elite distribution cloud collaboration equilibrium method used for feature extraction of electronic medical record
CN110544124A (en) waste mobile phone pricing method based on fuzzy neural network
CN113722951B (en) Scatterer three-dimensional finite element grid optimization method based on neural network
CN106203469A (en) A kind of figure sorting technique based on orderly pattern

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20171222

WD01 Invention patent application deemed withdrawn after publication