CN108985318A - A kind of global optimization K mean cluster method and system based on sample rate - Google Patents

A kind of global optimization K mean cluster method and system based on sample rate Download PDF

Info

Publication number
CN108985318A
CN108985318A CN201810525709.1A CN201810525709A CN108985318A CN 108985318 A CN108985318 A CN 108985318A CN 201810525709 A CN201810525709 A CN 201810525709A CN 108985318 A CN108985318 A CN 108985318A
Authority
CN
China
Prior art keywords
density value
submanifold
cluster
sample point
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810525709.1A
Other languages
Chinese (zh)
Inventor
许鸿文
薛印玺
陈雯
李羚
殷蔚明
谢靖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN201810525709.1A priority Critical patent/CN108985318A/en
Publication of CN108985318A publication Critical patent/CN108985318A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Aiming at the problem that present invention cluster result present in traditional K mean cluster method easily falls into local optimum dependent on initial cluster center, a kind of global optimization K mean cluster KMS-GOSD method and system based on sample rate are proposed.In an iterative process, KMS-GOSD method passes through Gauss model first and obtains the pre-estimation density value of all cluster centres, and actual density value is then carried out offset operation lower than the maximum cluster centre of pre-estimation density value.By optimizing cluster centre position, KMS-GOSD method can not only promote global exploring ability, and can overcome the dependence to cluster initial center point.Comparison of the present invention is carried out using the UCI data set of standard, it is found that improved method has higher accuracy rate and stability compared to traditional method.

Description

A kind of global optimization K mean cluster method and system based on sample rate
Technical field
The present invention relates to the fast density peak values in machine learning to cluster field, more particularly to a kind of based on sample rate Global optimization K mean cluster method and system.
Background technique
Traditional K- means clustering method has many advantages, such as simple and effective, fast convergence rate, convenient for processing large data collection, mesh It is preceding to be widely used in many fields such as scientific research and industrial applicability.But the overall situation present in complex data collection explores energy Power is weak, the problems such as easily falling into local optimum, is still the heavy difficulties improved in the research of K mean value.
Using the core concept of density peaks clustering method, domestic and foreign scholars carry out with different angle K Mean Method Analysis and improvement.The country such as Xing Changzheng proposes a kind of method based on averag density clustering of optimizing initial centers.Using clustering The attribute and structure for analyzing data object before, choose suitable initial cluster center, instead of traditional random initial center of K mean value Point, and keep the iterative process of tradition K mean value constant.Foreign scholar proposes to pass through kernel function, adaptive neural network in iterative process The method that the methods of network method, differential evolution method auxiliary K mean value find the high sample point of density in global scope.Such as text Offer " Approximate Normalized Cuts without Eigen-decomposition Information Sciences " in propose by using approximation weight kernel optimization objective function.
Since traditional K mean value is sensitive to cluster centre, so that the selection of cluster centre will directly affect cluster accuracy rate Just.A kind of document " optimization of the initial cluster center of K-means algorithm " " Distributed Cluster excavation calculation based on local density Method " point out that cluster centre should be at the relatively high point of sample rate in cluster." a kind of improved k-means initially gathers document Class Research on Center Selection Algorithm " a kind of " new k-means cluster centre Algorithms of Selecting " " K- of minimum variance optimization cluster centre Means algorithm " by showing theory analysis and effect of the present invention: when cluster centre is located at sample rate higher, gather Class accuracy rate can be obviously improved.
Summary of the invention
Part is easily fallen into most dependent on initial cluster center for cluster result present in traditional K mean cluster method Excellent problem, to avoid the speed explored to the excessive analysis of hash object and the quickening method overall situation, the present invention proposes base In global optimization K mean cluster method (the Global Optimized K-means Clustering of sample rate Algorithm based on Sample Density, abbreviation KMS-GOSD).In the process of traditional K mean cluster method iteration In, KMS-GOSD method is by the way that actual density value to be displaced in such lower than the maximum cluster centre of pre-estimation density value Greater than the point of pre-estimation density value, realization avoids falling into local optimum, so overcome cluster result to initial cluster center according to Lai Xing.Simultaneously before offset, being added gradually decreases pre-estimation density value with the decay factor that the number of iterations is inversely proportional, and then drops The deflection probability of low cluster centre.It can guarantee that there is KMS-GOSD method early period stronger global exploring ability, later period in this way Also there is stronger stability.
A kind of global optimization K mean cluster method based on sample rate comprises the steps of:
S1, raw data set X, submanifold number K and scale parameter Ra comprising N number of sample point are obtained, wherein N is greater than 1;
S2, K sample point is randomly selected as initial cluster center in the raw data set X, be denoted as wi, wherein i =1,2,3 ..., K;
S3, calculate separately initial data concentrate initial cluster center other than all sample points apart from each initial clustering Center wiDistance, and by the initial data concentrate initial cluster center other than all sample points be assigned to away from nearest Initial cluster center formed K submanifold;
S4, the mass center of all submanifolds is denoted as W respectivelyi, according to formulaCalculate Wi's Pre-estimation density value Fi,t, and calculate WiActual density value Fi,c
Wherein m is the maximum number of iterations of the number of iterations and is preset value, and t indicates current the number of iterations, 2 φ (3 σ × Ra) value can be tabled look-up and obtain according to standard normal distribution function;
S5, the mass center W by each submanifoldiAs new cluster centre, where judging each new cluster centre respectively It whether there is actual density value F in submanifoldi,cLess than pre-estimation density value Fi,tSample point, if it does not, jumping to S10; If it does, jumping to S6;
S6, actual density value F is obtainedi,cWith pre-estimation density value Fi,tThe maximum sample point of absolute difference where son Cluster;
S7, in the submanifold that S6 is obtained, obtain several sample points at random, and calculate separately several sample points Actual density value Fi,c
S8, judge in each and every one several sample points with the presence or absence of actual density value Fi,cGreater than pre-estimation density value Fi,t's Sample point;If it is present jumping to S10;Otherwise S9 is jumped to;
S9, by actual density value Fi,cWith pre-estimation density value Fi,tThe maximum sample point of absolute difference is as new cluster Then center executes step S10;
S10, judge cluster centre WiWhether no longer change, jumps to S11 if meeting;Otherwise the number of iterations t is updated to t + 1, using new cluster centre as new initial cluster center, jump to S3;
S11, output cluster result.
In a kind of global optimization K mean cluster method based on sample rate of the invention, the actual density value Fi,c According to formulaIt calculates, whereindijFor WiJ-th of sample into i-th of submanifold This nijEuclidean distance, SiIndicate the number of sample point in i-th of submanifold, j is j-th of sample point, c ∈ [1, cmax], cmax For preset peak excursion number, r=R × Ra;R is most long distance of the cluster centre from sample point in the submanifold of place in any submanifold From.
The global optimization K mean cluster system based on sample rate that the present invention also provides a kind of, comprising with lower module:
Initialization module, for obtaining raw data set X, submanifold number K and scale parameter Ra comprising N number of sample point, Wherein N is greater than 1;
Initial cluster center obtains module, for randomly selecting K sample point in the raw data set X as initial Cluster centre is denoted as wi, wherein i=1,2,3 ..., K;
Submanifold forms module, for calculating separately all sample point distances other than initial data concentration initial cluster center Each initial cluster center wiDistance, and the initial data is concentrated into all sample points point other than initial cluster center It is fitted on away from nearest initial cluster center wiForm K submanifold;
Pre-estimation density value and actual density value computing module, for the mass center of all submanifolds to be denoted as W respectivelyi, pass through FormulaCalculate WiPre-estimation density value Fi,t, and calculate WiActual density value Fi,c
Wherein m is the maximum number of iterations of the number of iterations and is preset value, and t indicates current the number of iterations, 2 φ (3 σ × Ra) value can be tabled look-up and obtain according to standard normal distribution function;
First judgment module, for by the mass center W of each submanifoldiAs new cluster centre, judge respectively each new It whether there is actual density value F in submanifold where cluster centrei,cLess than pre-estimation density value Fi,tSample point, if do not deposited Jumping to third judgment module;If it does, jumping to submanifold obtains module;
Submanifold obtains module, for obtaining actual density value Fi,cWith pre-estimation density value Fi,tAbsolute difference it is maximum Submanifold where sample point;
Actual density value computing module, for obtaining several samples at random in the submanifold that submanifold obtains that module obtains Point, and calculate separately the actual density value F of several sample pointsi,c
Second judgment module, for judging in each and every one several sample points with the presence or absence of actual density value Fi,cGreater than pre- Estimate density value Fi,tSample point;If it is present jumping to third judgment module;Otherwise new cluster centre is jumped to obtain Modulus block;
New cluster centre obtains module, is used for actual density value Fi,cWith pre-estimation density value Fi,tAbsolute difference is most Then big sample point executes third judgment module as new cluster centre;
Third judgment module, for judging cluster centre WiWhether no longer change, output result mould is jumped to if meeting Block;Otherwise the number of iterations t is updated to t+1, using new cluster centre as new initial cluster center, jumps to submanifold and forms mould Block;
Object module is exported, for exporting cluster result.
In a kind of global optimization K mean cluster system based on sample rate of the invention, the pre-estimation density value With the actual density value F in actual density value computing modulei,cAccording to formulaIt calculates, whereindijFor WiJ-th of sample point n into i-th of submanifoldijEuclidean distance, SiIt indicates in i-th of submanifold The number of sample point, j are j-th of sample point, c ∈ [1, cmax], cmaxFor preset peak excursion number, r=R × Ra;R is to appoint Maximum distance of the cluster centre from sample point in the submanifold of place in one submanifold.
The method of the present invention compares traditional K mean cluster method, and in an iterative process, KMS-GOSD method passes through Gauss first Model obtains the pre-estimation density value of all cluster centres, and actual density value is then lower than the maximum cluster of pre-estimation density value Center carries out offset operation.By optimizing cluster centre position, KMS-GOSD method can not only promote global exploring ability, and The dependence to cluster initial center point can be overcome.Comparison of the present invention is carried out using the UCI data set of standard, after discovery improves Method have higher accuracy rate and stability compared to traditional method.
Detailed description of the invention
Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:
Fig. 1 is flow chart of the embodiment of the present invention;
Fig. 2 is Gaussian distribution model distribution map corresponding with subclass sample dot density.
Specific embodiment
In order to which the technical features, objects and effects of invention are more clearly understood, now compare attached drawing and this is described in detail The specific embodiment of invention.
Since traditional K mean value is sensitive to cluster centre, so that the selection of cluster centre will directly affect cluster accuracy rate Just.A kind of document " optimization of the initial cluster center of K-means algorithm " " Distributed Cluster excavation calculation based on local density Method " point out that cluster centre should be at the relatively high point of sample rate in cluster." a kind of improved k-means initially gathers document Class Research on Center Selection Algorithm " a kind of " new k-means cluster centre Algorithms of Selecting " " K- of minimum variance optimization cluster centre Means algorithm " by showing theory analysis and effect of the present invention: when cluster centre is located at sample rate higher, gather Class accuracy rate can be obviously improved.The core concept of global optimization K mean cluster algorithm proposed by the present invention is by iteration Dynamic finds the higher cluster centre replacement of density and chooses the higher initial cluster center of density directly before iteration in the process.For Judge the height of cluster centre density, the present invention passes through Gauss model first and obtains the initial pre-estimation density value of iteration, then sharp It is calculated with Euclidean distance by the actual density value of cluster centre, will be finally gradually reduced in actual density value and iterative process pre- Estimation is compared, and obtains the higher cluster centre of density.
What is proposed in document " a kind of Distributed Cluster mining algorithm based on local density " mixes ideal cluster Made of sample set, monotone variation trend thought is presented from intra-cluster to edge in the local density of sample point in each ideal cluster On the basis of, the Density Distribution of subclass sample point is estimated using Gaussian distribution model.Assuming that in the data set X for having N number of sample point In, subclass number is K, initial cluster center wi(i=1,2,3 ..., K), cluster centre is W in iterative processi, submanifold radius R is WiThe distance of farthest sample point in the cluster.For convenience of the estimation density of statistics submanifold, one is chosen with wiFor the center of circle, half The circle of diameter r=R × Ra, the sample areas as statistics.
As shown in Fig. 2, under decay factor Ra effect, x1=3 σ × Ra.Pass through calculating [- x1,x1] distribution letter in range Number estimation WiDensity in r circle.Calculate pre-estimation density value:
Fi- 1 (1)=2 φ (3 σ × Ra)
The actual density value of cluster centre:
Wherein dijFor WiTo sample point nijEuclidean distance, SiIndicate the number of sample point in i-th of submanifold, c ∈ [1, cmax], cmaxFor peak excursion number,
Traditional K mean value, can be again using the mass center of new subclass as the cluster centre W of new class in iterative process each timei。 Local distance and the smallest point are easily trapped into when mass center is as cluster centre, to easily fall into local optimum, and mass center is simultaneously Non- globe optimum.For the appearance for avoiding such case, offset operation is added in KMS-GOSD algorithm in an iterative process, by density
Wherein m indicates maximum the number of iterations, and t indicates current the number of iterations.It can gradually be dropped by the way that decay factor is added Low pre-estimation density value, accelerates the convergence of migration process.For the complexity for reducing KMS-GOSD algorithm, migration process is only occurred in WiOn that minimum cluster centre of middle density, other cluster centres are remained unchanged.Then KMS-GOSD algorithm early period can There is biggish exploring ability, the later period is gradually consistent with traditional K mean value and has stronger development ability and constringency performance.
A kind of global optimization K mean cluster method based on sample rate, detailed process are shown in Fig. 1, include following step It is rapid:
S1, raw data set X, submanifold number K and scale parameter Ra comprising N number of sample point are obtained, wherein N is greater than 1;
S2, K sample point is randomly selected as initial cluster center in the raw data set X, be denoted as wi, wherein i =1,2,3 ..., K;
S3, calculate separately initial data concentrate initial cluster center other than all sample points apart from each initial clustering Center wiDistance, and by the initial data concentrate initial cluster center other than all sample points be assigned to away from nearest Initial cluster center formed K submanifold;
S4, the mass center of all submanifolds is denoted as W respectivelyi, according to formulaCalculate Wi's Pre-estimation density value Fi,t, and calculate WiActual density value Fi,c
Wherein m is the maximum number of iterations of the number of iterations and is preset value, and t indicates current the number of iterations, 2 φ (3 σ × Ra) value can be tabled look-up and obtain according to standard normal distribution function;
S5, the mass center W by each submanifoldiAs new cluster centre, where judging each new cluster centre respectively It whether there is actual density value F in submanifoldi,cLess than pre-estimation density value Fi,tSample point, if it does not, jumping to S10; If it does, jumping to S6;
S6, actual density value F is obtainedi,cWith pre-estimation density value Fi,tThe maximum sample point of absolute difference where son Cluster;
S7, in the submanifold that S6 is obtained, obtain several sample points at random, and calculate separately several sample points Actual density value Fi,c
S8, judge in each and every one several sample points with the presence or absence of actual density value Fi,cGreater than pre-estimation density value Fi,t's Sample point;If it is present jumping to S10;Otherwise S9 is jumped to;
S9, by actual density value Fi,cWith pre-estimation density value Fi,tThe maximum sample point of absolute difference is as new cluster Then center executes step S10;
S10, judge cluster centre WiWhether no longer change, jumps to S11 if meeting;Otherwise the number of iterations t is updated to t + 1, using new cluster centre as new initial cluster center, jump to S3;
S11, output cluster result.
Wherein the loop termination Rule of judgment in S10 is cluster centre WiWhether no longer change, specifically, in initial clustering The heart obtains second cluster centre after executing a complete process, by second cluster centre compared with initial cluster center Compared with if changed, second of circulation of execution obtains third cluster centre.If second cluster centre and initial clustering Center is identical, then jumps out circulation output category result.Third cluster centre is compared with second cluster centre, if changed Become, then executes third time and recycle, obtain the 4th cluster centre.If second cluster centre and second cluster centre phase Together, then circulation output category result is jumped out, and so on, circulation is executed until obtaining final cluster centre.
In a kind of global optimization K mean cluster method based on sample rate of the invention, the actual density value Fi,c According to formulaIt calculates, whereindijFor WiJ-th of sample into i-th of submanifold This nijEuclidean distance, SiIndicate the number of sample point in i-th of submanifold, j is j-th of sample point, c ∈ [1, cmax], cmax For preset peak excursion number, r=R × Ra;R is most long distance of the cluster centre from sample point in the submanifold of place in any submanifold From.
In order to verify the validity of KMS-GOSD method, Balance, Wine in selection standard UCI data set of the present invention, For this five groups of data of Zoo, Iris and Diabetes as test data, essential information is as shown in table 1.Every one kind data are used respectively Traditional K mean value and improved K mean value are tested 50 times.The present invention uses Windows 7, Matlab 2013a programmed environment, host It is configured to 2 double-core P8600@2.4GHz processor of Intel Duo, running memory 4GB.
Table 1 selects data set explanation
In guaranteeing the present invention every time in the initial cluster center situation identical with tradition K mean value of KMS-GOSD method, Result of the present invention such as the following table 2, table 3, table 4.
Test result of 2 two methods of table to Balance data set
It can be obtained by table 2, for Balance data set, KMS-GOSD method highest accuracy rate reaches 73.15%, minimum Also average up to 70.49% up to 68.93%, it is little compared with the fluctuation range of high-accuracy, maintain essentially in 71% or so.
Test result of 3 two methods of table to Wine data set
As shown in Table 3, for Wine data set, the accuracy rate of traditional K mean value is generally relatively low.But KMS-GOSD method highest Reach 78.14%, minimum 69.60%, average up to 74.76%, more traditional K mean value is improved nearly 10 percentage point.
It is obtained by table 4, for Zoo data set, KMS-GOSD method highest accuracy rate reaches 82.20%, minimum 77.92%, it is average up to 80.54%, it is all higher in the case of remaining wherein only accuracy rate once is identical with tradition K mean value.
Test result of 4 two methods of table to Zoo data set
Test result of 5 two methods of table to Iris data set
As shown in Table 5, for Iris data set, KMS-GOSD method highest accuracy rate reaches 88.72%, minimum to be also 87.32%, average up to 88.23%, accuracy rate substantially remains in average value or so, has stability preferable.
Test result of 6 two methods of table to Diabetes number data set
As shown in Table 6, for Diabetes data set, KMS-GOSD method highest accuracy rate has 68.10%, minimum to be 65.17%, average 66.83%, more traditional K mean value can stablize raising 3% or so.
Using identical random initial center point, KMS- can be seen that from above result of the present invention GOSD method can press down to a certain extent compared to traditional K mean cluster method accuracy rate with higher and stability Dependence of the cluster result processed to initial cluster center.
It can be seen that by table 2- table 6, traditional K mean value has the shortcomings that initially to the dependence of cluster centre, causing to be easy Fall into local optimum.Text propose KMS-GOSD method pass through in an iterative process Gauss model obtain cluster centre pre-estimation it is close Angle value, and actual density value is displaced to the higher point of density lower than the maximum cluster centre of pre-estimation density value.By above Operation not only reduces the calculation amount in energy data analysis, but also cluster result can be overcome to the dependence of initial cluster center And cluster centre is enhanced to global exploring ability.The present invention the result shows that, the KMS-GOSD in UCI typical case's test data set Method can promote 20.68% for different data set accuracy rate highests.
The embodiment of the present invention is described with above attached drawing, but the invention is not limited to above-mentioned specific Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art Under the inspiration of the present invention, without breaking away from the scope protected by the purposes and claims of the present invention, it can also make very much Form, all of these belong to the protection of the present invention.

Claims (4)

1. a kind of global optimization K mean cluster method based on sample rate, which is characterized in that comprise the steps of:
S1, raw data set X, submanifold number K and scale parameter Ra comprising N number of sample point are obtained, wherein N is greater than 1;
S2, K sample point is randomly selected as initial cluster center in the raw data set X, be denoted as wi, wherein i=1,2, 3,…,K;
S3, calculate separately initial data concentrate initial cluster center other than all sample points apart from each initial cluster center wiDistance, and by the initial data concentrate all sample points other than initial cluster center be assigned to away from it is nearest just Beginning cluster centre forms K submanifold;
S4, the mass center of all submanifolds is denoted as W respectivelyi, according to formulaCalculate WiEstimate Count density value Fi,t, and calculate WiActual density value Fi,c
Wherein m is the maximum number of iterations of the number of iterations and is preset value, and t indicates current the number of iterations, 2 (3 φ σ × Ra value) is tabled look-up according to standard normal distribution function to be obtained;
S5, the mass center W by each submanifoldiAs new cluster centre, judged in the submanifold where each new cluster centre respectively With the presence or absence of actual density value Fi,cLess than pre-estimation density value Fi,tSample point, if it does not, jumping to S10;If deposited Jumping to S6;
S6, actual density value F is obtainedi,cWith pre-estimation density value Fi,tThe maximum sample point of absolute difference where submanifold;
S7, in the submanifold that S6 is obtained, obtain several sample points at random, and calculate separately the reality of several sample points Density value Fi,c
S8, judge in each and every one several sample points with the presence or absence of actual density value Fi,cGreater than pre-estimation density value Fi,tSample Point;If it is present jumping to S10;Otherwise S9 is jumped to;
S9, by actual density value Fi,cWith pre-estimation density value Fi,tThe maximum sample point of absolute difference is as in new cluster Then the heart executes step S10;
S10, judge cluster centre WiWhether no longer change, jumps to S11 if meeting;Otherwise the number of iterations t is updated to t+1, will New cluster centre jumps to S3 as new initial cluster center;
S11, output cluster result.
2. a kind of global optimization K mean cluster method based on sample rate according to claim 1, which is characterized in that institute State actual density value Fi,cAccording to formulaIt calculates, whereindijFor WiTo i-th J-th of sample point n in a submanifoldijEuclidean distance, SiIndicating the number of sample point in i-th of submanifold, j is j-th of sample point, c∈[1,cmax], cmaxFor preset peak excursion number, r=R × Ra;R is cluster centre in any submanifold from the submanifold of place The maximum distance of sample point.
3. a kind of global optimization K mean cluster system based on sample rate, which is characterized in that comprising with lower module:
Initialization module, for obtaining raw data set X, submanifold number K and scale parameter Ra comprising N number of sample point, wherein N Greater than 1;
Initial cluster center obtains module, for randomly selecting K sample point as initial clustering in the raw data set X Center is denoted as wi, wherein i=1,2,3 ..., K;
Submanifold forms module, concentrates all sample points other than initial cluster center apart from each for calculating separately initial data A initial cluster center wiDistance, and by the initial data concentrate initial cluster center other than all sample points be assigned to Away from nearest initial cluster center wiForm K submanifold;
Pre-estimation density value and actual density value computing module, for the mass center of all submanifolds to be denoted as W respectivelyi, pass through formulaCalculate WiPre-estimation density value Fi,t, and calculate WiActual density value Fi,c
Wherein m is the maximum number of iterations of the number of iterations and is preset value, and t indicates current the number of iterations, 2 φ (3 σ × Ra value) can table look-up according to standard normal distribution function and obtain;
First judgment module, for by the mass center W of each submanifoldiAs new cluster centre, judged in each new cluster respectively It whether there is actual density value F in submanifold where the hearti,cLess than pre-estimation density value Fi,tSample point, if it does not, jump Go to third judgment module;If it does, jumping to submanifold obtains module;
Submanifold obtains module, for obtaining actual density value Fi,cWith pre-estimation density value Fi,tThe maximum sample of absolute difference Submanifold where point;
Actual density value computing module, for obtaining several sample points at random in the submanifold that submanifold obtains that module obtains, and Calculate separately the actual density value F of several sample pointsi,c
Second judgment module, for judging in each and every one several sample points with the presence or absence of actual density value Fi,cGreater than pre-estimation Density value Fi,tSample point;If it is present jumping to third judgment module;Otherwise it jumps to new cluster centre and obtains mould Block;
New cluster centre obtains module, is used for actual density value Fi,cWith pre-estimation density value Fi,tAbsolute difference is maximum Then sample point executes third judgment module as new cluster centre;
Third judgment module, for judging cluster centre WiWhether no longer change, jumps to output object module if meeting;It is no Then the number of iterations t is updated to t+1, using new cluster centre as new initial cluster center, jumps to submanifold and forms module;
Object module is exported, for exporting cluster result.
4. a kind of global optimization K mean cluster system based on sample rate according to claim 1, which is characterized in that institute State the actual density value F in pre-estimation density value and actual density value computing modulei,cAccording to formulaCome It calculates, whereindijFor WiJ-th of sample point n into i-th of submanifoldijEuclidean distance, SiIndicate the The number of sample point in i submanifold, j are j-th of sample point, c ∈ [1, cmax], cmaxFor preset peak excursion number, r=R ×Ra;R is maximum distance of the cluster centre from sample point in the submanifold of place in any submanifold.
CN201810525709.1A 2018-05-28 2018-05-28 A kind of global optimization K mean cluster method and system based on sample rate Pending CN108985318A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810525709.1A CN108985318A (en) 2018-05-28 2018-05-28 A kind of global optimization K mean cluster method and system based on sample rate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810525709.1A CN108985318A (en) 2018-05-28 2018-05-28 A kind of global optimization K mean cluster method and system based on sample rate

Publications (1)

Publication Number Publication Date
CN108985318A true CN108985318A (en) 2018-12-11

Family

ID=64542224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810525709.1A Pending CN108985318A (en) 2018-05-28 2018-05-28 A kind of global optimization K mean cluster method and system based on sample rate

Country Status (1)

Country Link
CN (1) CN108985318A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046801A (en) * 2019-03-25 2019-07-23 国网江苏省电力有限公司经济技术研究院 A kind of typical scene generation method of power distribution network electric system
CN112101210A (en) * 2020-09-15 2020-12-18 贵州电网有限责任公司 Low-voltage distribution network fault diagnosis method based on multi-source information fusion
WO2021044251A1 (en) * 2019-09-06 2021-03-11 International Business Machines Corporation Elastic-centroid based clustering
CN113378954A (en) * 2021-06-23 2021-09-10 云南电网有限责任公司电力科学研究院 Load curve clustering method and system based on particle swarm improved K-means algorithm
CN113850281A (en) * 2021-02-05 2021-12-28 天翼智慧家庭科技有限公司 Data processing method and device based on MEANSHIFT optimization

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046801A (en) * 2019-03-25 2019-07-23 国网江苏省电力有限公司经济技术研究院 A kind of typical scene generation method of power distribution network electric system
CN110046801B (en) * 2019-03-25 2023-07-21 国网江苏省电力有限公司经济技术研究院 Typical scene generation method of power distribution network power system
WO2021044251A1 (en) * 2019-09-06 2021-03-11 International Business Machines Corporation Elastic-centroid based clustering
US11727250B2 (en) 2019-09-06 2023-08-15 International Business Machines Corporation Elastic-centroid based clustering
CN112101210A (en) * 2020-09-15 2020-12-18 贵州电网有限责任公司 Low-voltage distribution network fault diagnosis method based on multi-source information fusion
CN113850281A (en) * 2021-02-05 2021-12-28 天翼智慧家庭科技有限公司 Data processing method and device based on MEANSHIFT optimization
WO2022166380A1 (en) * 2021-02-05 2022-08-11 天翼数字生活科技有限公司 Data processing method and apparatus based on meanshift optimization
CN113850281B (en) * 2021-02-05 2024-03-12 天翼数字生活科技有限公司 MEANSHIFT optimization-based data processing method and device
CN113378954A (en) * 2021-06-23 2021-09-10 云南电网有限责任公司电力科学研究院 Load curve clustering method and system based on particle swarm improved K-means algorithm

Similar Documents

Publication Publication Date Title
CN108985318A (en) A kind of global optimization K mean cluster method and system based on sample rate
CN106682682A (en) Method for optimizing support vector machine based on Particle Swarm Optimization
CN105868775A (en) Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm
CN111986811A (en) Disease prediction system based on big data
CN110083665A (en) Data classification method based on the detection of improved local outlier factor
CN110059852A (en) A kind of stock yield prediction technique based on improvement random forests algorithm
CN109086412A (en) A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT
CN104731916A (en) Optimizing initial center K-means clustering method based on density in data mining
De Amorim Constrained clustering with minkowski weighted k-means
CN109145965A (en) Cell recognition method and device based on random forest disaggregated model
CN104573708A (en) Ensemble-of-under-sampled extreme learning machine
CN111062425B (en) Unbalanced data set processing method based on C-K-SMOTE algorithm
CN108564592A (en) Based on a variety of image partition methods for being clustered to differential evolution algorithm of dynamic
CN109444840B (en) Radar clutter suppression method based on machine learning
CN109271427A (en) A kind of clustering method based on neighbour's density and manifold distance
CN107045717A (en) The detection method of leucocyte based on artificial bee colony algorithm
CN113435108B (en) Battlefield target grouping method based on improved whale optimization algorithm
CN109150830A (en) A kind of multilevel intrusion detection method based on support vector machines and probabilistic neural network
CN115310554A (en) Item allocation strategy, system, storage medium and device based on deep clustering
CN116821715A (en) Artificial bee colony optimization clustering method based on semi-supervision constraint
CN114841241A (en) Unbalanced data classification method based on clustering and distance weighting
CN110032973A (en) A kind of unsupervised helminth classification method and system based on artificial intelligence
CN105913085A (en) Tensor model-based multi-source data classification optimizing method and system
CN109934344B (en) Improved multi-target distribution estimation method based on rule model
CN117035983A (en) Method and device for determining credit risk level, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181211

RJ01 Rejection of invention patent application after publication