CN108846434A - A kind of missing data fill method based on improvement K-means clustering algorithm - Google Patents

A kind of missing data fill method based on improvement K-means clustering algorithm Download PDF

Info

Publication number
CN108846434A
CN108846434A CN201810597825.4A CN201810597825A CN108846434A CN 108846434 A CN108846434 A CN 108846434A CN 201810597825 A CN201810597825 A CN 201810597825A CN 108846434 A CN108846434 A CN 108846434A
Authority
CN
China
Prior art keywords
data
means clustering
clustering algorithm
missing data
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810597825.4A
Other languages
Chinese (zh)
Inventor
蔡延光
陈东
蔡颢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201810597825.4A priority Critical patent/CN108846434A/en
Publication of CN108846434A publication Critical patent/CN108846434A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Abstract

The invention belongs to data processing field, classified using improved K-means clustering algorithm to data more particularly, to a kind of based on the missing data fill method for improving K-means clustering algorithm, then missing data is filled using desired maximum value process.Specific steps:S1. it proposes a kind of artificial fish-swarm algorithm, determines the K value of K-means clustering algorithm;S2. a kind of improved K-means clustering algorithm is proposed;S3. the objective function f (x) of K-means clustering algorithm is devised;S4. a kind of improved expectation maximum value process is proposed, the missing data concentrated to data is filled.A kind of missing data fill method based on improvement K-means clustering algorithm proposed by the present invention is filled missing data with higher speed and precision, and filling missing data works well.

Description

A kind of missing data fill method based on improvement K-means clustering algorithm
Technical field
The invention belongs to data processing fields, more particularly, to a kind of based on the missing for improving K-means clustering algorithm Data filling method.
Background technique
Missing data refer to data acquisition or transmission process in due to human operational error or mechanical aspects, cause sky Value or undesirable numerical value are mingled in the data value in data acquisition system.The case where missing data, is in remote health monitoring system It is relatively common in system, format is lack of standardization or data transmission etc. due to, lead to shortage of data.Common missing data filling Algorithm has Multiple Imputation, multiple regression completion method, expectation maximum value completion method etc..
Summary of the invention
The present invention in order to overcome at least one of the drawbacks of the prior art described above, provides a kind of poly- based on K-means is improved The missing data fill method of class algorithm effectively increases the effect of filling missing data.
In order to solve the above technical problems, the technical solution adopted by the present invention is that:One kind is based on improvement K-means clustering algorithm Missing data fill method, include the following steps:
S1. artificial fish-swarm algorithm is utilized, determines the K value of K-means clustering algorithm;
S2. a kind of improved K-means clustering algorithm is proposed, including:
S21. it proposes a kind of termination condition of objective function f (x) as K-means clustering algorithm, determines objective function f (x) formula is:
In formula, x indicates that data object, K indicate cluster centre number, ciIndicate ith cluster center, dist indicates that Europe is several In distance;
S22. determine that K data object treats as the initial cluster center of K-means clustering algorithm from data set;
S23. all data objects are calculated to the Euclidean distance of K initial cluster center, it will be every according to the distance of distance A data object is divided to away from nearest cluster centre;
S24. the cluster centre that them are recalculated for each cluster, obtains K new data cluster centre point;
S25. judge whether objective function f (x) restrains, if objective function f (x) restrains, terminate algorithm, output cluster knot Fruit;If objective function f (x) does not restrain, i.e., new cluster centre is not consistent with the K cluster centre that last iteration obtains, then Step S23 is repeated to step S25;
S3. it selects improved K-means clustering algorithm to classify missing data, judges the type of missing data, The reference data set of filling missing data is determined according to the type of missing data, then data are concentrated using desired maximum value process Missing data is filled.
Further, the S3 step specifically includes:
S31. distribution parameter is initialized;
S32. the initial value θ of expectation maximum value process is determined(0), initial value is current observation data set XobsAverage value;
S33. the greatest hope step of filling data, i.e. E step is calculated as follows:
E(Xfill|Xobs(k))=θ(k-1)
In formula, k indicates the number of iterations, XfillIndicate Filling power, E (Xfill|Xobs(k)) indicate filling data desired value, θ(k)Indicate the evaluation parameter of kth step;
S34. the maximal possibility estimation parameter value of maximum expected value is calculated as follows, that is, maximizes step, i.e. M step:
In formula, p indicates observation data set XobsNumber, n indicate conceptual data number;XiFor the position of current manual fish It sets, j is observation data set XobsNumber add 1;
S35. judge whether reach the condition of convergence, carry out next step S36 if meeting;Conversely, going to step S33;Wherein, the condition of convergence is calculated according to the following formula:
|E(Xfill|Xobs(k))-E(Xfill|Xobs(k-1)) | < ε
In formula, ε indicates convergence parameter;
S36. predicted value X is exportedfill, it is filled according to this predicted value come the missing data concentrated to data.
Further, the S22 step specifically includes:Select 2 times of extreme point number of artificial fish-swarm algorithm as The cluster number K of K-means algorithm, the position of each extreme point as the initial cluster center of K-means algorithm.
Compared with prior art, beneficial effect is:A kind of lacking based on improvement K-means clustering algorithm provided by the invention Data filling method is lost, so that the time of filling missing data is shorter, effectively increases the effect of missing data filling.
Detailed description of the invention
Fig. 1 is the method for the present invention flow chart.
Fig. 2 is traditional K-means clustering algorithm and the classification provided by the invention based on improved K-means clustering algorithm Accuracy comparison figure.
Fig. 3 is traditional K-means clustering algorithm and the missing provided by the invention based on improved K-means clustering algorithm The average time-consuming comparison diagram of data filling.
Specific embodiment
Attached drawing only for illustration, is not considered as limiting the invention;In order to better illustrate this embodiment, attached Scheme certain components to have omission, zoom in or out, does not represent the size of actual product;To those skilled in the art, The omitting of some known structures and their instructions in the attached drawings are understandable.Being given for example only property of positional relationship is described in attached drawing Illustrate, is not considered as limiting the invention.
Embodiment 1:
As shown in Figure 1, it is a kind of based on the missing data fill method for improving K-means clustering algorithm, include the following steps:
Step 1. utilizes artificial fish-swarm algorithm, determines the K value of K-means clustering algorithm;It specifically includes:
S11. artificial fish-swarm algorithm initializes, and generates the initial shoal of fish.Determine scale N=30, the number of iterations N of Artificial Fishc、 Maximum sounds out number Trynum=10, maximum number of iterations Nc_max=100, Artificial Fish maximum moving step length Step=0.3, crowded Spend the field range Visual=0.75 of factor delta=10, Artificial Fish.Inside claimed range, 30 Artificial Fishs, shape are arbitrarily generated At initial artificial fish-swarm.
S12. bulletin board is initialized.The present position of each original manual fish is calculated, is recorded using bulletin board Calculate resulting optimal value.
S13. housing choice behavior.Each Artificial Fish is carried out bunch behavior and behavior of knocking into the back, and two kinds of behaviors is selected to calculate The relatively good value arrived, the behavior of default are foraging behavior.After Artificial Fish individual has executed all behaviors, each time all The record of the calculated value of its present position and bulletin board is compared.If its value is got well than the record of bulletin board, just use Its present position replaces the record before bulletin board.The action process of Artificial Fish further comprises:
S131. foraging behavior.Assuming that the position of current manual fish is Xi, arbitrarily choose one inside its field range it is new Position XjIf it two swarm similarity Yi< Yj, just further according to formula (1) row toward this direction;Otherwise, it chooses again Another position Xj, thus it is speculated that whether meet condition.After iteration Trynum times, if still not meeting condition, then Artificial Fish is pressed Formula (2) arbitrarily one step of transfer;
Xi|next=Xi+Rand()·Step (2)
In formula, Rand () obeys distribution U (0,1);
S132. bunch behavior.Assuming that the position of current manual fish is Xi, in its (d within sweep of the eyeij< Visual) it searches Seek the quantity n of its partnerfWith center vector XcIf Yc/nf> δ Yi, indicate XcSurrounding is an optimal solution, at this moment, Toward center XcBy formula (3), row further, otherwise, implements foraging behavior forward in direction;
S133. it knocks into the back behavior.Assuming that current manual's fish state is Xi, in its (d within sweep of the eyeij< Visual) explore Yj It is X for maximum buddy locationmaxIf Yj/nf> δ Yi, indicate partner XmaxLocate swarm similarity with higher and less gathers around It squeezes, then it can be to XmaxDirection is further forward according to formula (4), otherwise, implements foraging behavior;
S14. judge NcWhether N is equal toc.If so, terminating algorithm, optimal value is exported.If it is not, returning to S13, and Nc+1。
Step 2. proposes a kind of improved K-means clustering algorithm, including:
S21. it proposes a kind of termination condition of objective function f (x) as K-means clustering algorithm, determines objective function f (x) formula is:
In formula, x indicates that data object, K indicate cluster centre number, ciIndicate ith cluster center, dist indicates that Europe is several In distance;
S22. determine that K data object treats as the initial cluster center of K-means clustering algorithm from data set;Select 2 times of the extreme point number of artificial fish-swarm algorithm work as the position of each extreme point as the cluster number K of K-means algorithm Make the initial cluster center of K-means algorithm;
S23. all data objects are calculated to the Euclidean distance of K initial cluster center, it will be every according to the distance of distance A data object is divided to away from nearest cluster centre;
S24. the cluster centre that them are recalculated for each cluster, obtains K new data cluster centre point;
S25. judge whether objective function f (x) restrains, if objective function f (x) restrains, terminate algorithm, output cluster knot Fruit;If objective function f (x) does not restrain, i.e., new cluster centre is not consistent with the K cluster centre that last iteration obtains, then Step S23 is repeated to step S25.
Step 3. proposes a kind of improved expectation maximum value process, and the missing data concentrated to data is filled.Select Improved K-means clustering algorithm classifies to missing data, judges the type of missing data, according to the class of missing data Type determines the reference data set of filling missing data, then is filled out using the missing data that desired maximum value process concentrates data It fills.Specifically include following steps:
S31. distribution parameter is initialized;
S32. the initial value θ of expectation maximum value process is determined(0), initial value is current observation data set XobsAverage value;
S33. the greatest hope step of filling data, i.e. E step is calculated as follows:
E(Xfill|Xobs(k))=θ(k-1)
In formula, k indicates the number of iterations, XfillIndicate Filling power, E (Xfill|Xobs(k)) indicate filling data desired value, θ(k)Indicate the evaluation parameter of kth step;
S34. the maximal possibility estimation parameter value of maximum expected value is calculated as follows, that is, maximizes step, i.e. M step:
In formula, p indicates observation data set XobsNumber, n indicate conceptual data number;XiFor the position of current manual fish It sets, j is observation data set XobsNumber add 1.
S35. judge whether reach the condition of convergence, carry out next step S36 if meeting;Conversely, going to step S33;Wherein, the condition of convergence is calculated according to the following formula:
|E(Xfill|Xobs(k))-E(Xfill|Xobs(k-1)) | < ε
In formula, ε indicates convergence parameter;
S36. predicted value X is exportedfill, it is filled according to this predicted value come the missing data concentrated to data.
It after the above implementation steps, is computed, traditional K-means clustering algorithm and the K- based on artificial fish-swarm algorithm The nicety of grading of means clustering algorithm compares as shown in Fig. 2, based on tradition K-means clustering algorithm and based on improvement K-means The average time-consuming comparison of the missing data filling of clustering algorithm is as shown in figure 3, can be seen that this hair from the comparing result of Fig. 2 and Fig. 3 It is bright to provide a kind of fill method of more effective missing data.
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims (3)

1. a kind of based on the missing data fill method for improving K-means clustering algorithm, which is characterized in that include the following steps:
S1. artificial fish-swarm algorithm is utilized, determines the K value of K-means clustering algorithm;
S2. a kind of improved K-means clustering algorithm is proposed, including:
S21. it proposes a kind of termination condition of objective function f (x) as K-means clustering algorithm, determines objective function f's (x) Formula is:
In formula, x indicates that data object, K indicate cluster centre number, ciIndicate ith cluster center, dist indicates Euclid Distance;
S22. determine that K data object treats as the initial cluster center of K-means clustering algorithm from data set;
S23. calculate all data objects to K initial cluster center Euclidean distance, according to the distance of distance by every number It is divided to according to object away from nearest cluster centre;
S24. the cluster centre that them are recalculated for each cluster, obtains K new data cluster centre point;
S25. judge whether objective function f (x) restrains, if objective function f (x) restrains, terminate algorithm, export cluster result; If objective function f (x) does not restrain, i.e., new cluster centre is not consistent with the K cluster centre that last iteration obtains, then repeats Execute step S23 to step S25;
S3. it selects improved K-means clustering algorithm to classify missing data, judges the type of missing data, according to The type of missing data determines the reference data set of filling missing data, then the missing concentrated using desired maximum value process to data Data are filled.
2. according to claim 1 a kind of based on the missing data fill method for improving K-means clustering algorithm, feature It is, the S3 step specifically includes:
S31. distribution parameter is initialized;
S32. the initial value θ of expectation maximum value process is determined(0), initial value is current observation data set XobsAverage value;
S33. the greatest hope step of filling data, i.e. E step is calculated as follows:
E(Xfill|Xobs(k))=θ(k-1)
In formula, k indicates the number of iterations, XfillIndicate Filling power, E (Xfill|Xobs(k)) indicate filling data desired value, θ(k)Table Show the evaluation parameter of kth step;
S34. the maximal possibility estimation parameter value of maximum expected value is calculated as follows, that is, maximizes step, i.e. M step:
In formula, p indicates observation data set XobsNumber, n indicate conceptual data number;XiFor the position of current manual fish, j is Observe data set XobsNumber add 1;
S35. judge whether reach the condition of convergence, carry out next step S36 if meeting;Conversely, the S33 that gos to step;Its In, the condition of convergence is calculated according to the following formula:
|E(Xfill|Xobs(k))-E(Xfill|Xobs(k-1)) | < ε
In formula, ε indicates convergence parameter;
S36. predicted value X is exportedfill, it is filled according to this predicted value come the missing data concentrated to data.
3. according to claim 2 a kind of based on the missing data fill method for improving K-means clustering algorithm, feature It is, the S22 step specifically includes:2 times of the extreme point number of selection artificial fish-swarm algorithm are as K-means algorithm Number K is clustered, the position of each extreme point as the initial cluster center of K-means algorithm.
CN201810597825.4A 2018-06-11 2018-06-11 A kind of missing data fill method based on improvement K-means clustering algorithm Pending CN108846434A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810597825.4A CN108846434A (en) 2018-06-11 2018-06-11 A kind of missing data fill method based on improvement K-means clustering algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810597825.4A CN108846434A (en) 2018-06-11 2018-06-11 A kind of missing data fill method based on improvement K-means clustering algorithm

Publications (1)

Publication Number Publication Date
CN108846434A true CN108846434A (en) 2018-11-20

Family

ID=64211659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810597825.4A Pending CN108846434A (en) 2018-06-11 2018-06-11 A kind of missing data fill method based on improvement K-means clustering algorithm

Country Status (1)

Country Link
CN (1) CN108846434A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275895A (en) * 2019-06-25 2019-09-24 广东工业大学 It is a kind of to lack the filling equipment of traffic data, device and method
CN110659268A (en) * 2019-08-15 2020-01-07 中国平安财产保险股份有限公司 Data filling method and device based on clustering algorithm and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104301996A (en) * 2014-04-15 2015-01-21 河南科技大学 Wireless sensor network positioning method
CN106127262A (en) * 2016-06-29 2016-11-16 海南大学 The clustering method of one attribute missing data collection
CN106407258A (en) * 2016-08-24 2017-02-15 广东工业大学 Missing data prediction method and apparatus
CN107291765A (en) * 2016-04-05 2017-10-24 南京航空航天大学 The clustering method of processing missing data is planned based on DC

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104301996A (en) * 2014-04-15 2015-01-21 河南科技大学 Wireless sensor network positioning method
CN107291765A (en) * 2016-04-05 2017-10-24 南京航空航天大学 The clustering method of processing missing data is planned based on DC
CN106127262A (en) * 2016-06-29 2016-11-16 海南大学 The clustering method of one attribute missing data collection
CN106407258A (en) * 2016-08-24 2017-02-15 广东工业大学 Missing data prediction method and apparatus

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孙淑敏 等: "基于改进K-means算法的关键帧提取", 《计算机工程》 *
梁秉毅 等: "基于优化决策树和EM的缺失数据填充算法", 《自动化与信息工程》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275895A (en) * 2019-06-25 2019-09-24 广东工业大学 It is a kind of to lack the filling equipment of traffic data, device and method
CN110659268A (en) * 2019-08-15 2020-01-07 中国平安财产保险股份有限公司 Data filling method and device based on clustering algorithm and computer equipment

Similar Documents

Publication Publication Date Title
Albouy et al. Predicting trophic guild and diet overlap from functional traits: statistics, opportunities and limitations for marine ecology
CN101515338B (en) Artificial fish-swarm algorithm based on overall information
CN108846434A (en) A kind of missing data fill method based on improvement K-means clustering algorithm
CN105023011A (en) HMM based crop phenology dynamic estimation method
CN109558889B (en) Live pig comfort degree analysis method and device
CN108459406A (en) Microscope auto-focusing window selection method based on artificial fish-swarm algorithm
CN109816087A (en) Rough set attribute reduction method based on artificial fish-swarm and frog group&#39;s hybrid algorithm
CN109273097A (en) A kind of automatic generation method and device of drug indication
CN112461342B (en) Aquatic product weighing method, terminal equipment and storage medium
Liang et al. Recent advances in particle swarm optimization via population structuring and individual behavior control
CN106682451B (en) A kind of formula rate of biological tissue&#39;s simulation material determines method and system
Jagadeesan et al. An artificial fish swarm optimized fuzzy mri image segmentation approach for improving identification of brain tumour
EP4094576A1 (en) Estimation program, estimation method, and information processing device
Peterman et al. Bayesian decision analysis and uncertainty in fisheries management
CN111222475A (en) Pig tail biting detection method and device and storage medium
CN107491831A (en) A kind of MIMO radar optimizing location method adaptively terminated under more monitor areas
CN114980007A (en) Wireless sensor node deployment method, device, equipment and readable storage medium
CN111465031B (en) Dynamic node scheduling method based on DQN algorithm in wireless body area network
Hutchinson et al. Stochastic control theory applied to fishery management
Lu et al. A hybrid of fish swarm algorithm and shuffled frog leaping algorithm for attribute reduction
Xu et al. A/sup 4/C: an adaptive artificial ants clustering algorithm
Liu et al. Research on Wireless Sensor Network Localization Based on An Improved Whale Optimization Algorithm
Rosales et al. Oreochromis niloticus growth performance analysis using pixel transformation and pattern recognition
Li et al. Stroke detection based on an improved artificial fish swarm algorithm
CN105426910B (en) A kind of adaptive clustering scheme based on improvement ABC algorithm and DE Mutation Strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181120