CN107506778A - A kind of mass data clustering processing method based on minimum spanning tree - Google Patents
A kind of mass data clustering processing method based on minimum spanning tree Download PDFInfo
- Publication number
- CN107506778A CN107506778A CN201710467400.7A CN201710467400A CN107506778A CN 107506778 A CN107506778 A CN 107506778A CN 201710467400 A CN201710467400 A CN 201710467400A CN 107506778 A CN107506778 A CN 107506778A
- Authority
- CN
- China
- Prior art keywords
- node
- mrow
- mass data
- spanning tree
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/231—Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of mass data clustering processing method based on minimum spanning tree of the present invention, belongs to taxology and data mining algorithm technical field.This method establishes the mass data tree of full-mesh according to the Prim algorithm of minimum spanning tree;Applicable distance metric is determined according to the physical significance of mass data, and the side right value of mass data minimum spanning tree is determined according to distance metric rule;According to the corresponding node incidence matrix of mass data minimum spanning tree structural generation, pass through the redundant data in symmetrical treatment deletion of node incidence matrix;With reference to the weight diversity factor of the mass data incidence matrix difference several sections of points of computed altitude, and it is ranked up according to magnitude relationship;The longer sides of node weights diversity factor higher point are removed according to the actual physical meaning of mass data, so as to obtain the sample point cluster of ideal quantity.The present invention can carry out data clusters processing to mass data, can reduce the difficulty of subsequent data analysis.
Description
Technical field
A kind of mass data clustering processing method based on minimum spanning tree of the present invention, belongs to taxology and data mining is calculated
Law technology field.
Background technology
With the progress of computer science, because sample size is huge in increasing data analysis, sample point is difficult to
Described according to unified distribution form, it is therefore desirable to carry out the data clusters processing of early stage.Cluster is will be a series of related
Data object is combined with each other, so as to form the method for the stronger data acquisition system of some dependency relations so that in each data set
Multiple objects in conjunction have each other more closely contacts relation.
Currently used clustering method includes k-means clustering methods, hierarchical clustering method and fuzzy clustering algorithm etc..These
The most selections for depending on original state of method, cluster is judged according to completely in accordance with the distance between sample point measurement, right
Some have the sample clustering effect of specific physical significance and bad.
But in the application scenarios of machine learning field and correlation, the large-scale data instruction of physical significance usually occurs
Practice the scene of model, if not carrying out clustering processing to data, then training process internally stores to hardware and calculating speed is equal
There is higher requirement.In addition, the cluster result obtained by commonly using clustering method is difficult to the concept for incorporating physical significance, therefore obtain
The result arrived is often unsatisfactory, causes subsequent data analysis to be worked with model training and larger error occurs, and then correlation is ground
The development studied carefully causes heavy losses.In order to improve this problem, it is necessary to be improved on the basis of traditional clustering method, use
The new mode sample data stronger to physical significance is handled, so as to obtain preferable data result.
It is one of feasible method for avoiding above-mentioned error that manual decision's auxiliary is added in clustering algorithm.Conventional clustering method
Process is single, the tediously long complexity of calculating process, it is difficult to incorporates the influence of artificial judgment decision-making, therefore most your pupil has been used in this patent
A kind of improved sample point clustering method is devised into tree algorithm.Minimal spanning tree algorithm is the conventional algorithm in planning application field
One of, the minimum spanning tree by calculating multiple nodes can realize that construction cost or other various aspects of performance are most in engineer applied
Excellent design, and the tree of data point can be established, because the characteristics of its is concise, facilitates policymaker to be analyzed,
Therefore it is suitable for being lifted the process performance of clustering method.
Weather condition information is usually used in the output forecasting problem of all kinds of distributed power sources in electric power research field.However,
Due to more than weather condition species and data are numerous and diverse, therefore directly applied in can not actually calculating again.
The content of the invention
The purpose of the present invention is to propose to a kind of mass data clustering processing method based on minimum spanning tree, pass through Puli's nurse
The clustering processing of mass data is realized in algorithm and human assistance decision-making, so as to provide support for subsequent data analysis work.
Mass data clustering processing method proposed by the present invention based on minimum spanning tree, comprises the following steps:
(1) pending mass data U is converted into node matrix equation A;
It is dist () to set the distance between any two data in pending mass data U, by the distance
Assignment of the dist () as matrix A, corresponding with node matrix equation A is a full-mesh figure, the side right weight of full-mesh figure
For dist (), and by distance dist () as the side right weight between any two data, if pending magnanimity
The number of data is m, then node matrix equation A is shown below:
(2) one minimum edge weight node sparse matrix is obtained to node matrix equation A processing using Puli's nurse method
Am:
Am=Lm+Um
With above-mentioned node sparse matrix AmCorresponding is a minimum spanning tree, wherein LmFor AmThe latter half, UmFor
AmTop half;
(3) matrix L in above-mentioned steps (2) is counted respectivelymThe i-th row and i-th row in the node i in minimum spanning tree
The quantity D (U) on connected side, and the quantity D (U) is designated as to the degree of respective nodes in node matrix equation A;
(4) according to above-mentioned quantity D (U), using following formula, the weight difference on the side that calculating is connected with nodes of the D (U) more than 2
Measurement
Wherein, j and k is respectively the node being connected in the minimum spanning tree of step (2) with node i;
(5) the cluster value n of a mass data clustering processing is set, according to above-mentioned weight difference measurementSize, it is right
Respective nodes are ranked up, and obtain a sequence node, by the maximum side of side right weight in the preceding n-1 node of sequence node from upper
State and deleted in the minimum spanning tree of step (2), obtain n mutually disjunct trees, each the node in tree forms a data and gathered
Class, n data clusters are obtained, that is, complete the mass data clustering processing based on minimum spanning tree.
Mass data clustering processing method proposed by the present invention based on minimum spanning tree, is characterized in:
Prim algorithm of the invention by calculating minimum spanning tree, it is proposed that the clustering processing method of mass data,
Suitable for the clustering processing of the stronger multidimensional sample data of physical significance.Due to having used minimal spanning tree algorithm, therefore the party
The distance metric that method can be combined in calculating process in clustering technique is parsed as tree Zhi Quanchong, overall so as to generate
System minimum spanning tree the most close.On this basis, can be in order to policymaker with reference to clear concise sample point tree
Appropriate human assistance amendment is carried out, finally gives preferable sample cluster classification.
The present invention has advantages below:
1st, it is of the invention to have used the Prim algorithm for establishing minimum spanning tree, compared to traditional clustering method, this method
Clustering Effect it is unrelated with the selection of initial point, and the contact of the data point inside each sample cluster is even closer, thus
Can more it be stablized in actual use and excellent Clustering Effect.
2nd, manual decision's auxiliary, the sample point stronger to physical significance are added in the present invention on the basis of clustering method
Can be corrected during clustering processing by the result of decision, error band is clustered to reduce in follow-up data analysis work
Loss, the accuracy of lift scheme training.
Brief description of the drawings
Fig. 1 (a) is the clustering distribution schematic diagram that minimum spanning tree directly removes longest edge.
Fig. 1 (b) is the clustering distribution schematic diagram of the inventive method.
Fig. 1 (c) is the clustering distribution schematic diagram of conventional clustering method.
Fig. 2 is the Clustering Effect contrast of the inventive method and common method.
Embodiment
Mass data clustering processing method proposed by the present invention based on minimum spanning tree, it can be used for that there is stronger physics
In terms of the clustering processing of the multidimensional data sample point of meaning, this method comprises the following steps:
(1) determine that weather condition sample point minimum spanning tree follows in this method is Prim algorithm and correlometer algorithm
Then, establish the minimum spanning tree of sample point and multiple sample clusters are generated by calculate node weight diversity factor.By pending magnanimity
Data U is converted into node matrix equation A;
It is dist () to set the distance between any two data in pending mass data U, by the distance
Assignment of the dist () as matrix A, corresponding with node matrix equation A is a full-mesh figure, the side right weight of full-mesh figure
For dist (), and by distance dist () as the side right weight between any two data, if pending magnanimity
The number of data is m, then node matrix equation A is shown below:
(2) one minimum edge weight node sparse matrix is obtained to node matrix equation A processing using Puli's nurse method
Am:
Am=Lm+Um
With above-mentioned node sparse matrix AmCorresponding is a minimum spanning tree,
Wherein LmFor AmThe latter half, UmFor AmTop half, due to minimum edge weight node sparse matrix AmTo be right
Claim matrix, therefore only take its latter half LmAnalyzed;
(3) matrix L in above-mentioned steps (2) is counted respectivelymThe i-th row and i-th row in the node i in minimum spanning tree
The quantity D (U) on connected side, and the quantity D (U) is designated as to the degree of respective nodes in node matrix equation A;
(4) according to above-mentioned quantity D (U), using following formula, the weight difference on the side that calculating is connected with nodes of the D (U) more than 2
Measurement
Wherein, j and k is respectively the node being connected in the minimum spanning tree of step (2) with node i;
(5) the cluster value n of a mass data clustering processing is set, according to above-mentioned weight difference measurementSize, it is right
Respective nodes are ranked up, and obtain a sequence node, by the maximum side of side right weight in the preceding n-1 node of sequence node from upper
State and deleted in the minimum spanning tree of step (2), obtain n mutually disjunct trees, each the node in tree forms a data and gathered
Class, n data clusters are obtained, that is, complete the mass data clustering processing based on minimum spanning tree, Fig. 1 (b) show according to
The sample point clustering distribution schematic diagram that the inventive method obtains.The sample point clustering distribution such as Fig. 1 obtained according to the inventive method
(b) shown in, Fig. 1 (a) is the clustering distribution situation of removal longest edge after generation sample point minimum spanning tree, and Fig. 1 (c) is k-
The clustering distribution situation of means methods.
The performance comparison of cluster is as shown in Fig. 2 it can be seen that the Clustering Effect of the inventive method refers in DB indexes
Mark (Davies-Bouldin Index, BDI), Dunn indexes index (Dunn Index, DI) and weighted index index (Weight
Index, WI) on have more excellent performance, therefore can be widely applied to real data processing operating process in.
Claims (1)
- A kind of 1. mass data clustering processing method based on minimum spanning tree, it is characterised in that this method comprises the following steps:(1) pending mass data U is converted into node matrix equation A;It is dist () to set the distance between any two data in pending mass data U, by distance dist The assignment of () as matrix A, corresponding with node matrix equation A is a full-mesh figure, and the side right weight of full-mesh figure is Dist (), and by distance dist () as the side right weight between any two data, if pending magnanimity number According to number be m, then node matrix equation A is shown below:(2) one minimum edge weight node sparse matrix A is obtained to node matrix equation A processing using Puli's nurse methodm:Am=Lm+UmWith above-mentioned node sparse matrix AmCorresponding is a minimum spanning tree, wherein LmFor AmThe latter half, UmFor Am's Top half;(3) matrix L in above-mentioned steps (2) is counted respectivelymThe i-th row and i-th row in be connected with the node i in minimum spanning tree Side quantity D (U), and the quantity D (U) is designated as to the degree of respective nodes in node matrix equation A;(4) according to above-mentioned quantity D (U), using following formula, the weight difference measurement on the side that calculating is connected with nodes of the D (U) more than 2 θ:<mrow> <mi>&theta;</mi> <mrow> <mo>(</mo> <mi>U</mi> <mo>(</mo> <mi>i</mi> <mo>)</mo> <mo>)</mo> </mrow> <mo>=</mo> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mrow> <mo>(</mo> <mfrac> <mrow> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> <msup> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mrow> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> <msup> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>&le;</mo> <mi>D</mi> <mrow> <mo>(</mo> <mi>U</mi> <mo>(</mo> <mi>i</mi> <mo>)</mo> <mo>)</mo> </mrow> </mrow>Wherein, j and k is respectively the node being connected in the minimum spanning tree of step (2) with node i;(5) the cluster value n of a mass data clustering processing is set, according to above-mentioned weight difference measurement θ size, to corresponding section Put and be ranked up, obtain a sequence node, side right in the preceding n-1 node of sequence node is weighed into maximum side from above-mentioned steps (2) deleted in minimum spanning tree, obtain n mutually disjunct trees, each the node in tree forms a data clusters, there are To n data clusters, that is, complete the mass data clustering processing based on minimum spanning tree.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710467400.7A CN107506778A (en) | 2017-06-20 | 2017-06-20 | A kind of mass data clustering processing method based on minimum spanning tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710467400.7A CN107506778A (en) | 2017-06-20 | 2017-06-20 | A kind of mass data clustering processing method based on minimum spanning tree |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107506778A true CN107506778A (en) | 2017-12-22 |
Family
ID=60678437
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710467400.7A Pending CN107506778A (en) | 2017-06-20 | 2017-06-20 | A kind of mass data clustering processing method based on minimum spanning tree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107506778A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115116060A (en) * | 2022-08-25 | 2022-09-27 | 深圳前海环融联易信息科技服务有限公司 | Key value file processing method, device, equipment, medium and computer program product |
-
2017
- 2017-06-20 CN CN201710467400.7A patent/CN107506778A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115116060A (en) * | 2022-08-25 | 2022-09-27 | 深圳前海环融联易信息科技服务有限公司 | Key value file processing method, device, equipment, medium and computer program product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102521656B (en) | Integrated transfer learning method for classification of unbalance samples | |
CN109871995B (en) | Quantum optimization parameter adjusting method for distributed deep learning under Spark framework | |
CN111242282B (en) | Deep learning model training acceleration method based on end edge cloud cooperation | |
CN104573879A (en) | Photovoltaic power station output predicting method based on optimal similar day set | |
CN107705556A (en) | A kind of traffic flow forecasting method combined based on SVMs and BP neural network | |
CN107480696A (en) | A kind of disaggregated model construction method, device and terminal device | |
CN108765180A (en) | The overlapping community discovery method extended with seed based on influence power | |
CN110765582B (en) | Self-organization center K-means microgrid scene division method based on Markov chain | |
CN108182316B (en) | Electromagnetic simulation method based on artificial intelligence and electromagnetic brain thereof | |
CN107194818A (en) | Label based on pitch point importance propagates community discovery algorithm | |
CN104050547A (en) | Non-linear optimization decision-making method of planning schemes for oilfield development | |
CN110059765B (en) | Intelligent mineral identification and classification system and method | |
CN110096630A (en) | Big data processing method of the one kind based on clustering | |
CN114021483A (en) | Ultra-short-term wind power prediction method based on time domain characteristics and XGboost | |
CN116484495A (en) | Pneumatic data fusion modeling method based on test design | |
CN105373846A (en) | Oil gas gathering and transferring pipe network topological structure intelligent optimization method based on grading strategy | |
CN109977977A (en) | A kind of method and corresponding intrument identifying potential user | |
CN107506778A (en) | A kind of mass data clustering processing method based on minimum spanning tree | |
CN110956010B (en) | Large-scale new energy access power grid stability identification method based on gradient lifting tree | |
CN107133348A (en) | Extensive picture concentrates the proximity search method based on semantic consistency | |
CN110276478B (en) | Short-term wind power prediction method based on segmented ant colony algorithm optimization SVM | |
CN104462853A (en) | Population elite distribution cloud collaboration equilibrium method used for feature extraction of electronic medical record | |
CN110544124A (en) | waste mobile phone pricing method based on fuzzy neural network | |
CN113722951B (en) | Scatterer three-dimensional finite element grid optimization method based on neural network | |
CN106203469A (en) | A kind of figure sorting technique based on orderly pattern |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20171222 |
|
WD01 | Invention patent application deemed withdrawn after publication |