CN108846434A

CN108846434A - A kind of missing data fill method based on improvement K-means clustering algorithm

Info

Publication number: CN108846434A
Application number: CN201810597825.4A
Authority: CN
Inventors: 蔡延光; 陈东; 蔡颢
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2018-06-11
Filing date: 2018-06-11
Publication date: 2018-11-20

Abstract

The invention belongs to data processing field, classified using improved K-means clustering algorithm to data more particularly, to a kind of based on the missing data fill method for improving K-means clustering algorithm, then missing data is filled using desired maximum value process.Specific steps：S1. it proposes a kind of artificial fish-swarm algorithm, determines the K value of K-means clustering algorithm；S2. a kind of improved K-means clustering algorithm is proposed；S3. the objective function f (x) of K-means clustering algorithm is devised；S4. a kind of improved expectation maximum value process is proposed, the missing data concentrated to data is filled.A kind of missing data fill method based on improvement K-means clustering algorithm proposed by the present invention is filled missing data with higher speed and precision, and filling missing data works well.

Description

A kind of missing data fill method based on improvement K-means clustering algorithm

Technical field

The invention belongs to data processing fields, more particularly, to a kind of based on the missing for improving K-means clustering algorithm Data filling method.

Background technique

Missing data refer to data acquisition or transmission process in due to human operational error or mechanical aspects, cause sky Value or undesirable numerical value are mingled in the data value in data acquisition system.The case where missing data, is in remote health monitoring system It is relatively common in system, format is lack of standardization or data transmission etc. due to, lead to shortage of data.Common missing data filling Algorithm has Multiple Imputation, multiple regression completion method, expectation maximum value completion method etc..

Summary of the invention

The present invention in order to overcome at least one of the drawbacks of the prior art described above, provides a kind of poly- based on K-means is improved The missing data fill method of class algorithm effectively increases the effect of filling missing data.

In order to solve the above technical problems, the technical solution adopted by the present invention is that：One kind is based on improvement K-means clustering algorithm Missing data fill method, include the following steps：

S1. artificial fish-swarm algorithm is utilized, determines the K value of K-means clustering algorithm；

S2. a kind of improved K-means clustering algorithm is proposed, including：

S21. it proposes a kind of termination condition of objective function f (x) as K-means clustering algorithm, determines objective function f (x) formula is：

In formula, x indicates that data object, K indicate cluster centre number, c_iIndicate ith cluster center, dist indicates that Europe is several In distance；

S22. determine that K data object treats as the initial cluster center of K-means clustering algorithm from data set；

S23. all data objects are calculated to the Euclidean distance of K initial cluster center, it will be every according to the distance of distance A data object is divided to away from nearest cluster centre；

S24. the cluster centre that them are recalculated for each cluster, obtains K new data cluster centre point；

S25. judge whether objective function f (x) restrains, if objective function f (x) restrains, terminate algorithm, output cluster knot Fruit；If objective function f (x) does not restrain, i.e., new cluster centre is not consistent with the K cluster centre that last iteration obtains, then Step S23 is repeated to step S25；

S3. it selects improved K-means clustering algorithm to classify missing data, judges the type of missing data, The reference data set of filling missing data is determined according to the type of missing data, then data are concentrated using desired maximum value process Missing data is filled.

Further, the S3 step specifically includes：

S31. distribution parameter is initialized；

S32. the initial value θ of expectation maximum value process is determined⁽⁰⁾, initial value is current observation data set X_obsAverage value；

S33. the greatest hope step of filling data, i.e. E step is calculated as follows：

E(X_fill|X_obs,θ^(k))=θ^(k-1)

In formula, k indicates the number of iterations, X_fillIndicate Filling power, E (X_fill|X_obs,θ^(k)) indicate filling data desired value, θ^(k)Indicate the evaluation parameter of kth step；

S34. the maximal possibility estimation parameter value of maximum expected value is calculated as follows, that is, maximizes step, i.e. M step：

In formula, p indicates observation data set X_obsNumber, n indicate conceptual data number；X_iFor the position of current manual fish It sets, j is observation data set X_obsNumber add 1；

S35. judge whether reach the condition of convergence, carry out next step S36 if meeting；Conversely, going to step S33；Wherein, the condition of convergence is calculated according to the following formula：

|E(X_fill|X_obs,θ^(k))-E(X_fill|X_obs,θ^(k-1)) | < ε

In formula, ε indicates convergence parameter；

S36. predicted value X is exported_fill, it is filled according to this predicted value come the missing data concentrated to data.

Further, the S22 step specifically includes：Select 2 times of extreme point number of artificial fish-swarm algorithm as The cluster number K of K-means algorithm, the position of each extreme point as the initial cluster center of K-means algorithm.

Compared with prior art, beneficial effect is：A kind of lacking based on improvement K-means clustering algorithm provided by the invention Data filling method is lost, so that the time of filling missing data is shorter, effectively increases the effect of missing data filling.

Detailed description of the invention

Fig. 1 is the method for the present invention flow chart.

Fig. 2 is traditional K-means clustering algorithm and the classification provided by the invention based on improved K-means clustering algorithm Accuracy comparison figure.

Fig. 3 is traditional K-means clustering algorithm and the missing provided by the invention based on improved K-means clustering algorithm The average time-consuming comparison diagram of data filling.

Specific embodiment

Attached drawing only for illustration, is not considered as limiting the invention；In order to better illustrate this embodiment, attached Scheme certain components to have omission, zoom in or out, does not represent the size of actual product；To those skilled in the art, The omitting of some known structures and their instructions in the attached drawings are understandable.Being given for example only property of positional relationship is described in attached drawing Illustrate, is not considered as limiting the invention.

Embodiment 1：

As shown in Figure 1, it is a kind of based on the missing data fill method for improving K-means clustering algorithm, include the following steps：

Step 1. utilizes artificial fish-swarm algorithm, determines the K value of K-means clustering algorithm；It specifically includes：

S11. artificial fish-swarm algorithm initializes, and generates the initial shoal of fish.Determine scale N=30, the number of iterations N of Artificial Fish_c、 Maximum sounds out number Trynum=10, maximum number of iterations N_{c_max}=100, Artificial Fish maximum moving step length Step=0.3, crowded Spend the field range Visual=0.75 of factor delta=10, Artificial Fish.Inside claimed range, 30 Artificial Fishs, shape are arbitrarily generated At initial artificial fish-swarm.

S12. bulletin board is initialized.The present position of each original manual fish is calculated, is recorded using bulletin board Calculate resulting optimal value.

S13. housing choice behavior.Each Artificial Fish is carried out bunch behavior and behavior of knocking into the back, and two kinds of behaviors is selected to calculate The relatively good value arrived, the behavior of default are foraging behavior.After Artificial Fish individual has executed all behaviors, each time all The record of the calculated value of its present position and bulletin board is compared.If its value is got well than the record of bulletin board, just use Its present position replaces the record before bulletin board.The action process of Artificial Fish further comprises：

S131. foraging behavior.Assuming that the position of current manual fish is X_i, arbitrarily choose one inside its field range it is new Position X_jIf it two swarm similarity Y_i< Y_j, just further according to formula (1) row toward this direction；Otherwise, it chooses again Another position X_j, thus it is speculated that whether meet condition.After iteration Trynum times, if still not meeting condition, then Artificial Fish is pressed Formula (2) arbitrarily one step of transfer；

X_i|next=X_i+Rand()·Step (2)

In formula, Rand () obeys distribution U (0,1)；

S132. bunch behavior.Assuming that the position of current manual fish is X_i, in its (d within sweep of the eye_ij< Visual) it searches Seek the quantity n of its partner_fWith center vector X_cIf Y_c/n_f> δ Y_i, indicate X_cSurrounding is an optimal solution, at this moment, Toward center X_cBy formula (3), row further, otherwise, implements foraging behavior forward in direction；

S133. it knocks into the back behavior.Assuming that current manual's fish state is X_i, in its (d within sweep of the eye_ij< Visual) explore Y_j It is X for maximum buddy location_maxIf Y_j/n_f> δ Y_i, indicate partner X_maxLocate swarm similarity with higher and less gathers around It squeezes, then it can be to X_maxDirection is further forward according to formula (4), otherwise, implements foraging behavior；

S14. judge N_cWhether N is equal to_c.If so, terminating algorithm, optimal value is exported.If it is not, returning to S13, and N_c+1。

Step 2. proposes a kind of improved K-means clustering algorithm, including：

S22. determine that K data object treats as the initial cluster center of K-means clustering algorithm from data set；Select 2 times of the extreme point number of artificial fish-swarm algorithm work as the position of each extreme point as the cluster number K of K-means algorithm Make the initial cluster center of K-means algorithm；

S25. judge whether objective function f (x) restrains, if objective function f (x) restrains, terminate algorithm, output cluster knot Fruit；If objective function f (x) does not restrain, i.e., new cluster centre is not consistent with the K cluster centre that last iteration obtains, then Step S23 is repeated to step S25.

Step 3. proposes a kind of improved expectation maximum value process, and the missing data concentrated to data is filled.Select Improved K-means clustering algorithm classifies to missing data, judges the type of missing data, according to the class of missing data Type determines the reference data set of filling missing data, then is filled out using the missing data that desired maximum value process concentrates data It fills.Specifically include following steps：

S31. distribution parameter is initialized；

E(X_fill|X_obs,θ^(k))=θ^(k-1)

In formula, p indicates observation data set X_obsNumber, n indicate conceptual data number；X_iFor the position of current manual fish It sets, j is observation data set X_obsNumber add 1.

|E(X_fill|X_obs,θ^(k))-E(X_fill|X_obs,θ^(k-1)) | < ε

In formula, ε indicates convergence parameter；

It after the above implementation steps, is computed, traditional K-means clustering algorithm and the K- based on artificial fish-swarm algorithm The nicety of grading of means clustering algorithm compares as shown in Fig. 2, based on tradition K-means clustering algorithm and based on improvement K-means The average time-consuming comparison of the missing data filling of clustering algorithm is as shown in figure 3, can be seen that this hair from the comparing result of Fig. 2 and Fig. 3 It is bright to provide a kind of fill method of more effective missing data.

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of based on the missing data fill method for improving K-means clustering algorithm, which is characterized in that include the following steps：

S2. a kind of improved K-means clustering algorithm is proposed, including：

S21. it proposes a kind of termination condition of objective function f (x) as K-means clustering algorithm, determines objective function f's (x) Formula is：

In formula, x indicates that data object, K indicate cluster centre number, c_iIndicate ith cluster center, dist indicates Euclid Distance；

S23. calculate all data objects to K initial cluster center Euclidean distance, according to the distance of distance by every number It is divided to according to object away from nearest cluster centre；

S25. judge whether objective function f (x) restrains, if objective function f (x) restrains, terminate algorithm, export cluster result； If objective function f (x) does not restrain, i.e., new cluster centre is not consistent with the K cluster centre that last iteration obtains, then repeats Execute step S23 to step S25；

S3. it selects improved K-means clustering algorithm to classify missing data, judges the type of missing data, according to The type of missing data determines the reference data set of filling missing data, then the missing concentrated using desired maximum value process to data Data are filled.

2. according to claim 1 a kind of based on the missing data fill method for improving K-means clustering algorithm, feature It is, the S3 step specifically includes：

S31. distribution parameter is initialized；

E(X_fill|X_obs,θ^(k))=θ^(k-1)

In formula, k indicates the number of iterations, X_fillIndicate Filling power, E (X_fill|X_obs,θ^(k)) indicate filling data desired value, θ^(k)Table Show the evaluation parameter of kth step；

In formula, p indicates observation data set X_obsNumber, n indicate conceptual data number；X_iFor the position of current manual fish, j is Observe data set X_obsNumber add 1；

S35. judge whether reach the condition of convergence, carry out next step S36 if meeting；Conversely, the S33 that gos to step；Its In, the condition of convergence is calculated according to the following formula：

|E(X_fill|X_obs,θ^(k))-E(X_fill|X_obs,θ^(k-1)) | < ε

In formula, ε indicates convergence parameter；

3. according to claim 2 a kind of based on the missing data fill method for improving K-means clustering algorithm, feature It is, the S22 step specifically includes：2 times of the extreme point number of selection artificial fish-swarm algorithm are as K-means algorithm Number K is clustered, the position of each extreme point as the initial cluster center of K-means algorithm.