CN108985318A

CN108985318A - A kind of global optimization K mean cluster method and system based on sample rate

Info

Publication number: CN108985318A
Application number: CN201810525709.1A
Authority: CN
Inventors: 许鸿文; 薛印玺; 陈雯; 李羚; 殷蔚明; 谢靖
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2018-12-11

Abstract

Aiming at the problem that present invention cluster result present in traditional K mean cluster method easily falls into local optimum dependent on initial cluster center, a kind of global optimization K mean cluster KMS-GOSD method and system based on sample rate are proposed.In an iterative process, KMS-GOSD method passes through Gauss model first and obtains the pre-estimation density value of all cluster centres, and actual density value is then carried out offset operation lower than the maximum cluster centre of pre-estimation density value.By optimizing cluster centre position, KMS-GOSD method can not only promote global exploring ability, and can overcome the dependence to cluster initial center point.Comparison of the present invention is carried out using the UCI data set of standard, it is found that improved method has higher accuracy rate and stability compared to traditional method.

Description

A kind of global optimization K mean cluster method and system based on sample rate

Technical field

The present invention relates to the fast density peak values in machine learning to cluster field, more particularly to a kind of based on sample rate Global optimization K mean cluster method and system.

Background technique

Traditional K- means clustering method has many advantages, such as simple and effective, fast convergence rate, convenient for processing large data collection, mesh It is preceding to be widely used in many fields such as scientific research and industrial applicability.But the overall situation present in complex data collection explores energy Power is weak, the problems such as easily falling into local optimum, is still the heavy difficulties improved in the research of K mean value.

Using the core concept of density peaks clustering method, domestic and foreign scholars carry out with different angle K Mean Method Analysis and improvement.The country such as Xing Changzheng proposes a kind of method based on averag density clustering of optimizing initial centers.Using clustering The attribute and structure for analyzing data object before, choose suitable initial cluster center, instead of traditional random initial center of K mean value Point, and keep the iterative process of tradition K mean value constant.Foreign scholar proposes to pass through kernel function, adaptive neural network in iterative process The method that the methods of network method, differential evolution method auxiliary K mean value find the high sample point of density in global scope.Such as text Offer " Approximate Normalized Cuts without Eigen-decomposition Information Sciences " in propose by using approximation weight kernel optimization objective function.

Since traditional K mean value is sensitive to cluster centre, so that the selection of cluster centre will directly affect cluster accuracy rate Just.A kind of document " optimization of the initial cluster center of K-means algorithm " " Distributed Cluster excavation calculation based on local density Method " point out that cluster centre should be at the relatively high point of sample rate in cluster." a kind of improved k-means initially gathers document Class Research on Center Selection Algorithm " a kind of " new k-means cluster centre Algorithms of Selecting " " K- of minimum variance optimization cluster centre Means algorithm " by showing theory analysis and effect of the present invention: when cluster centre is located at sample rate higher, gather Class accuracy rate can be obviously improved.

Summary of the invention

Part is easily fallen into most dependent on initial cluster center for cluster result present in traditional K mean cluster method Excellent problem, to avoid the speed explored to the excessive analysis of hash object and the quickening method overall situation, the present invention proposes base In global optimization K mean cluster method (the Global Optimized K-means Clustering of sample rate Algorithm based on Sample Density, abbreviation KMS-GOSD).In the process of traditional K mean cluster method iteration In, KMS-GOSD method is by the way that actual density value to be displaced in such lower than the maximum cluster centre of pre-estimation density value Greater than the point of pre-estimation density value, realization avoids falling into local optimum, so overcome cluster result to initial cluster center according to Lai Xing.Simultaneously before offset, being added gradually decreases pre-estimation density value with the decay factor that the number of iterations is inversely proportional, and then drops The deflection probability of low cluster centre.It can guarantee that there is KMS-GOSD method early period stronger global exploring ability, later period in this way Also there is stronger stability.

A kind of global optimization K mean cluster method based on sample rate comprises the steps of:

S1, raw data set X, submanifold number K and scale parameter Ra comprising N number of sample point are obtained, wherein N is greater than 1；

S2, K sample point is randomly selected as initial cluster center in the raw data set X, be denoted as w_i, wherein i =1,2,3 ..., K；

S3, calculate separately initial data concentrate initial cluster center other than all sample points apart from each initial clustering Center w_iDistance, and by the initial data concentrate initial cluster center other than all sample points be assigned to away from nearest Initial cluster center formed K submanifold；

S4, the mass center of all submanifolds is denoted as W respectively_i, according to formulaCalculate W_i's Pre-estimation density value F_i,t, and calculate W_iActual density value F_i,c；

Wherein m is the maximum number of iterations of the number of iterations and is preset value, and t indicates current the number of iterations, 2 φ (3 σ × Ra) value can be tabled look-up and obtain according to standard normal distribution function；

S5, the mass center W by each submanifold_iAs new cluster centre, where judging each new cluster centre respectively It whether there is actual density value F in submanifold_i,cLess than pre-estimation density value F_i,tSample point, if it does not, jumping to S10； If it does, jumping to S6；

S6, actual density value F is obtained_i,cWith pre-estimation density value F_i,tThe maximum sample point of absolute difference where son Cluster；

S7, in the submanifold that S6 is obtained, obtain several sample points at random, and calculate separately several sample points Actual density value F_i,c；

S8, judge in each and every one several sample points with the presence or absence of actual density value F_i,cGreater than pre-estimation density value F_i,t's Sample point；If it is present jumping to S10；Otherwise S9 is jumped to；

S9, by actual density value F_i,cWith pre-estimation density value F_i,tThe maximum sample point of absolute difference is as new cluster Then center executes step S10；

S10, judge cluster centre W_iWhether no longer change, jumps to S11 if meeting；Otherwise the number of iterations t is updated to t + 1, using new cluster centre as new initial cluster center, jump to S3；

S11, output cluster result.

In a kind of global optimization K mean cluster method based on sample rate of the invention, the actual density value F_i,c According to formulaIt calculates, whereind_ijFor W_iJ-th of sample into i-th of submanifold This n_ijEuclidean distance, S_iIndicate the number of sample point in i-th of submanifold, j is j-th of sample point, c ∈ [1, c_max], c_max For preset peak excursion number, r=R × Ra；R is most long distance of the cluster centre from sample point in the submanifold of place in any submanifold From.

The global optimization K mean cluster system based on sample rate that the present invention also provides a kind of, comprising with lower module:

Initialization module, for obtaining raw data set X, submanifold number K and scale parameter Ra comprising N number of sample point, Wherein N is greater than 1；

Initial cluster center obtains module, for randomly selecting K sample point in the raw data set X as initial Cluster centre is denoted as w_i, wherein i=1,2,3 ..., K；

Submanifold forms module, for calculating separately all sample point distances other than initial data concentration initial cluster center Each initial cluster center w_iDistance, and the initial data is concentrated into all sample points point other than initial cluster center It is fitted on away from nearest initial cluster center w_iForm K submanifold；

Pre-estimation density value and actual density value computing module, for the mass center of all submanifolds to be denoted as W respectively_i, pass through FormulaCalculate W_iPre-estimation density value F_i,t, and calculate W_iActual density value F_i,c；

First judgment module, for by the mass center W of each submanifold_iAs new cluster centre, judge respectively each new It whether there is actual density value F in submanifold where cluster centre_i,cLess than pre-estimation density value F_i,tSample point, if do not deposited Jumping to third judgment module；If it does, jumping to submanifold obtains module；

Submanifold obtains module, for obtaining actual density value F_i,cWith pre-estimation density value F_i,tAbsolute difference it is maximum Submanifold where sample point；

Actual density value computing module, for obtaining several samples at random in the submanifold that submanifold obtains that module obtains Point, and calculate separately the actual density value F of several sample points_i,c；

Second judgment module, for judging in each and every one several sample points with the presence or absence of actual density value F_i,cGreater than pre- Estimate density value F_i,tSample point；If it is present jumping to third judgment module；Otherwise new cluster centre is jumped to obtain Modulus block；

New cluster centre obtains module, is used for actual density value F_i,cWith pre-estimation density value F_i,tAbsolute difference is most Then big sample point executes third judgment module as new cluster centre；

Third judgment module, for judging cluster centre W_iWhether no longer change, output result mould is jumped to if meeting Block；Otherwise the number of iterations t is updated to t+1, using new cluster centre as new initial cluster center, jumps to submanifold and forms mould Block；

Object module is exported, for exporting cluster result.

In a kind of global optimization K mean cluster system based on sample rate of the invention, the pre-estimation density value With the actual density value F in actual density value computing module_i,cAccording to formulaIt calculates, whereind_ijFor W_iJ-th of sample point n into i-th of submanifold_ijEuclidean distance, S_iIt indicates in i-th of submanifold The number of sample point, j are j-th of sample point, c ∈ [1, c_max], c_maxFor preset peak excursion number, r=R × Ra；R is to appoint Maximum distance of the cluster centre from sample point in the submanifold of place in one submanifold.

The method of the present invention compares traditional K mean cluster method, and in an iterative process, KMS-GOSD method passes through Gauss first Model obtains the pre-estimation density value of all cluster centres, and actual density value is then lower than the maximum cluster of pre-estimation density value Center carries out offset operation.By optimizing cluster centre position, KMS-GOSD method can not only promote global exploring ability, and The dependence to cluster initial center point can be overcome.Comparison of the present invention is carried out using the UCI data set of standard, after discovery improves Method have higher accuracy rate and stability compared to traditional method.

Detailed description of the invention

Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:

Fig. 1 is flow chart of the embodiment of the present invention；

Fig. 2 is Gaussian distribution model distribution map corresponding with subclass sample dot density.

Specific embodiment

In order to which the technical features, objects and effects of invention are more clearly understood, now compare attached drawing and this is described in detail The specific embodiment of invention.

Since traditional K mean value is sensitive to cluster centre, so that the selection of cluster centre will directly affect cluster accuracy rate Just.A kind of document " optimization of the initial cluster center of K-means algorithm " " Distributed Cluster excavation calculation based on local density Method " point out that cluster centre should be at the relatively high point of sample rate in cluster." a kind of improved k-means initially gathers document Class Research on Center Selection Algorithm " a kind of " new k-means cluster centre Algorithms of Selecting " " K- of minimum variance optimization cluster centre Means algorithm " by showing theory analysis and effect of the present invention: when cluster centre is located at sample rate higher, gather Class accuracy rate can be obviously improved.The core concept of global optimization K mean cluster algorithm proposed by the present invention is by iteration Dynamic finds the higher cluster centre replacement of density and chooses the higher initial cluster center of density directly before iteration in the process.For Judge the height of cluster centre density, the present invention passes through Gauss model first and obtains the initial pre-estimation density value of iteration, then sharp It is calculated with Euclidean distance by the actual density value of cluster centre, will be finally gradually reduced in actual density value and iterative process pre- Estimation is compared, and obtains the higher cluster centre of density.

What is proposed in document " a kind of Distributed Cluster mining algorithm based on local density " mixes ideal cluster Made of sample set, monotone variation trend thought is presented from intra-cluster to edge in the local density of sample point in each ideal cluster On the basis of, the Density Distribution of subclass sample point is estimated using Gaussian distribution model.Assuming that in the data set X for having N number of sample point In, subclass number is K, initial cluster center w_i(i=1,2,3 ..., K), cluster centre is W in iterative process_i, submanifold radius R is W_iThe distance of farthest sample point in the cluster.For convenience of the estimation density of statistics submanifold, one is chosen with w_iFor the center of circle, half The circle of diameter r=R × Ra, the sample areas as statistics.

As shown in Fig. 2, under decay factor Ra effect, x₁=3 σ × Ra.Pass through calculating [- x₁,x₁] distribution letter in range Number estimation W_iDensity in r circle.Calculate pre-estimation density value:

F_i- 1 (1)=2 φ (3 σ × Ra)

The actual density value of cluster centre:

Wherein d_ijFor W_iTo sample point n_ijEuclidean distance, S_iIndicate the number of sample point in i-th of submanifold, c ∈ [1, c_max], c_maxFor peak excursion number,

Traditional K mean value, can be again using the mass center of new subclass as the cluster centre W of new class in iterative process each time_i。 Local distance and the smallest point are easily trapped into when mass center is as cluster centre, to easily fall into local optimum, and mass center is simultaneously Non- globe optimum.For the appearance for avoiding such case, offset operation is added in KMS-GOSD algorithm in an iterative process, by density

Wherein m indicates maximum the number of iterations, and t indicates current the number of iterations.It can gradually be dropped by the way that decay factor is added Low pre-estimation density value, accelerates the convergence of migration process.For the complexity for reducing KMS-GOSD algorithm, migration process is only occurred in W_iOn that minimum cluster centre of middle density, other cluster centres are remained unchanged.Then KMS-GOSD algorithm early period can There is biggish exploring ability, the later period is gradually consistent with traditional K mean value and has stronger development ability and constringency performance.

A kind of global optimization K mean cluster method based on sample rate, detailed process are shown in Fig. 1, include following step It is rapid:

S11, output cluster result.

Wherein the loop termination Rule of judgment in S10 is cluster centre W_iWhether no longer change, specifically, in initial clustering The heart obtains second cluster centre after executing a complete process, by second cluster centre compared with initial cluster center Compared with if changed, second of circulation of execution obtains third cluster centre.If second cluster centre and initial clustering Center is identical, then jumps out circulation output category result.Third cluster centre is compared with second cluster centre, if changed Become, then executes third time and recycle, obtain the 4th cluster centre.If second cluster centre and second cluster centre phase Together, then circulation output category result is jumped out, and so on, circulation is executed until obtaining final cluster centre.

In order to verify the validity of KMS-GOSD method, Balance, Wine in selection standard UCI data set of the present invention, For this five groups of data of Zoo, Iris and Diabetes as test data, essential information is as shown in table 1.Every one kind data are used respectively Traditional K mean value and improved K mean value are tested 50 times.The present invention uses Windows 7, Matlab 2013a programmed environment, host It is configured to 2 double-core P8600@2.4GHz processor of Intel Duo, running memory 4GB.

Table 1 selects data set explanation

In guaranteeing the present invention every time in the initial cluster center situation identical with tradition K mean value of KMS-GOSD method, Result of the present invention such as the following table 2, table 3, table 4.

Test result of 2 two methods of table to Balance data set

It can be obtained by table 2, for Balance data set, KMS-GOSD method highest accuracy rate reaches 73.15%, minimum Also average up to 70.49% up to 68.93%, it is little compared with the fluctuation range of high-accuracy, maintain essentially in 71% or so.

Test result of 3 two methods of table to Wine data set

As shown in Table 3, for Wine data set, the accuracy rate of traditional K mean value is generally relatively low.But KMS-GOSD method highest Reach 78.14%, minimum 69.60%, average up to 74.76%, more traditional K mean value is improved nearly 10 percentage point.

It is obtained by table 4, for Zoo data set, KMS-GOSD method highest accuracy rate reaches 82.20%, minimum 77.92%, it is average up to 80.54%, it is all higher in the case of remaining wherein only accuracy rate once is identical with tradition K mean value.

Test result of 4 two methods of table to Zoo data set

Test result of 5 two methods of table to Iris data set

As shown in Table 5, for Iris data set, KMS-GOSD method highest accuracy rate reaches 88.72%, minimum to be also 87.32%, average up to 88.23%, accuracy rate substantially remains in average value or so, has stability preferable.

Test result of 6 two methods of table to Diabetes number data set

As shown in Table 6, for Diabetes data set, KMS-GOSD method highest accuracy rate has 68.10%, minimum to be 65.17%, average 66.83%, more traditional K mean value can stablize raising 3% or so.

Using identical random initial center point, KMS- can be seen that from above result of the present invention GOSD method can press down to a certain extent compared to traditional K mean cluster method accuracy rate with higher and stability Dependence of the cluster result processed to initial cluster center.

It can be seen that by table 2- table 6, traditional K mean value has the shortcomings that initially to the dependence of cluster centre, causing to be easy Fall into local optimum.Text propose KMS-GOSD method pass through in an iterative process Gauss model obtain cluster centre pre-estimation it is close Angle value, and actual density value is displaced to the higher point of density lower than the maximum cluster centre of pre-estimation density value.By above Operation not only reduces the calculation amount in energy data analysis, but also cluster result can be overcome to the dependence of initial cluster center And cluster centre is enhanced to global exploring ability.The present invention the result shows that, the KMS-GOSD in UCI typical case's test data set Method can promote 20.68% for different data set accuracy rate highests.

The embodiment of the present invention is described with above attached drawing, but the invention is not limited to above-mentioned specific Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art Under the inspiration of the present invention, without breaking away from the scope protected by the purposes and claims of the present invention, it can also make very much Form, all of these belong to the protection of the present invention.

Claims

1. a kind of global optimization K mean cluster method based on sample rate, which is characterized in that comprise the steps of:

S2, K sample point is randomly selected as initial cluster center in the raw data set X, be denoted as w_i, wherein i=1,2, 3,…,K；

S3, calculate separately initial data concentrate initial cluster center other than all sample points apart from each initial cluster center w_iDistance, and by the initial data concentrate all sample points other than initial cluster center be assigned to away from it is nearest just Beginning cluster centre forms K submanifold；

S4, the mass center of all submanifolds is denoted as W respectively_i, according to formulaCalculate W_iEstimate Count density value F_i,t, and calculate W_iActual density value F_i,c；

Wherein m is the maximum number of iterations of the number of iterations and is preset value, and t indicates current the number of iterations, 2 (3 φ σ × Ra value) is tabled look-up according to standard normal distribution function to be obtained；

S5, the mass center W by each submanifold_iAs new cluster centre, judged in the submanifold where each new cluster centre respectively With the presence or absence of actual density value F_i,cLess than pre-estimation density value F_i,tSample point, if it does not, jumping to S10；If deposited Jumping to S6；

S6, actual density value F is obtained_i,cWith pre-estimation density value F_i,tThe maximum sample point of absolute difference where submanifold；

S7, in the submanifold that S6 is obtained, obtain several sample points at random, and calculate separately the reality of several sample points Density value F_i,c；

S8, judge in each and every one several sample points with the presence or absence of actual density value F_i,cGreater than pre-estimation density value F_i,tSample Point；If it is present jumping to S10；Otherwise S9 is jumped to；

S9, by actual density value F_i,cWith pre-estimation density value F_i,tThe maximum sample point of absolute difference is as in new cluster Then the heart executes step S10；

S10, judge cluster centre W_iWhether no longer change, jumps to S11 if meeting；Otherwise the number of iterations t is updated to t+1, will New cluster centre jumps to S3 as new initial cluster center；

S11, output cluster result.

2. a kind of global optimization K mean cluster method based on sample rate according to claim 1, which is characterized in that institute State actual density value F_i,cAccording to formulaIt calculates, whereind_ijFor W_iTo i-th J-th of sample point n in a submanifold_ijEuclidean distance, S_iIndicating the number of sample point in i-th of submanifold, j is j-th of sample point, c∈[1,c_max], c_maxFor preset peak excursion number, r=R × Ra；R is cluster centre in any submanifold from the submanifold of place The maximum distance of sample point.

3. a kind of global optimization K mean cluster system based on sample rate, which is characterized in that comprising with lower module:

Initialization module, for obtaining raw data set X, submanifold number K and scale parameter Ra comprising N number of sample point, wherein N Greater than 1；

Initial cluster center obtains module, for randomly selecting K sample point as initial clustering in the raw data set X Center is denoted as w_i, wherein i=1,2,3 ..., K；

Submanifold forms module, concentrates all sample points other than initial cluster center apart from each for calculating separately initial data A initial cluster center w_iDistance, and by the initial data concentrate initial cluster center other than all sample points be assigned to Away from nearest initial cluster center w_iForm K submanifold；

Wherein m is the maximum number of iterations of the number of iterations and is preset value, and t indicates current the number of iterations, 2 φ (3 σ × Ra value) can table look-up according to standard normal distribution function and obtain；

First judgment module, for by the mass center W of each submanifold_iAs new cluster centre, judged in each new cluster respectively It whether there is actual density value F in submanifold where the heart_i,cLess than pre-estimation density value F_i,tSample point, if it does not, jump Go to third judgment module；If it does, jumping to submanifold obtains module；

Submanifold obtains module, for obtaining actual density value F_i,cWith pre-estimation density value F_i,tThe maximum sample of absolute difference Submanifold where point；

Actual density value computing module, for obtaining several sample points at random in the submanifold that submanifold obtains that module obtains, and Calculate separately the actual density value F of several sample points_i,c；

Second judgment module, for judging in each and every one several sample points with the presence or absence of actual density value F_i,cGreater than pre-estimation Density value F_i,tSample point；If it is present jumping to third judgment module；Otherwise it jumps to new cluster centre and obtains mould Block；

New cluster centre obtains module, is used for actual density value F_i,cWith pre-estimation density value F_i,tAbsolute difference is maximum Then sample point executes third judgment module as new cluster centre；

Third judgment module, for judging cluster centre W_iWhether no longer change, jumps to output object module if meeting；It is no Then the number of iterations t is updated to t+1, using new cluster centre as new initial cluster center, jumps to submanifold and forms module；

Object module is exported, for exporting cluster result.

4. a kind of global optimization K mean cluster system based on sample rate according to claim 1, which is characterized in that institute State the actual density value F in pre-estimation density value and actual density value computing module_i,cAccording to formulaCome It calculates, whereind_ijFor W_iJ-th of sample point n into i-th of submanifold_ijEuclidean distance, S_iIndicate the The number of sample point in i submanifold, j are j-th of sample point, c ∈ [1, c_max], c_maxFor preset peak excursion number, r=R ×Ra；R is maximum distance of the cluster centre from sample point in the submanifold of place in any submanifold.