CN108846434A - A kind of missing data fill method based on improvement K-means clustering algorithm - Google Patents
A kind of missing data fill method based on improvement K-means clustering algorithm Download PDFInfo
- Publication number
- CN108846434A CN108846434A CN201810597825.4A CN201810597825A CN108846434A CN 108846434 A CN108846434 A CN 108846434A CN 201810597825 A CN201810597825 A CN 201810597825A CN 108846434 A CN108846434 A CN 108846434A
- Authority
- CN
- China
- Prior art keywords
- data
- means clustering
- clustering algorithm
- missing data
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
Abstract
The invention belongs to data processing field, classified using improved K-means clustering algorithm to data more particularly, to a kind of based on the missing data fill method for improving K-means clustering algorithm, then missing data is filled using desired maximum value process.Specific steps:S1. it proposes a kind of artificial fish-swarm algorithm, determines the K value of K-means clustering algorithm;S2. a kind of improved K-means clustering algorithm is proposed;S3. the objective function f (x) of K-means clustering algorithm is devised;S4. a kind of improved expectation maximum value process is proposed, the missing data concentrated to data is filled.A kind of missing data fill method based on improvement K-means clustering algorithm proposed by the present invention is filled missing data with higher speed and precision, and filling missing data works well.
Description
Technical field
The invention belongs to data processing fields, more particularly, to a kind of based on the missing for improving K-means clustering algorithm
Data filling method.
Background technique
Missing data refer to data acquisition or transmission process in due to human operational error or mechanical aspects, cause sky
Value or undesirable numerical value are mingled in the data value in data acquisition system.The case where missing data, is in remote health monitoring system
It is relatively common in system, format is lack of standardization or data transmission etc. due to, lead to shortage of data.Common missing data filling
Algorithm has Multiple Imputation, multiple regression completion method, expectation maximum value completion method etc..
Summary of the invention
The present invention in order to overcome at least one of the drawbacks of the prior art described above, provides a kind of poly- based on K-means is improved
The missing data fill method of class algorithm effectively increases the effect of filling missing data.
In order to solve the above technical problems, the technical solution adopted by the present invention is that:One kind is based on improvement K-means clustering algorithm
Missing data fill method, include the following steps:
S1. artificial fish-swarm algorithm is utilized, determines the K value of K-means clustering algorithm;
S2. a kind of improved K-means clustering algorithm is proposed, including:
S21. it proposes a kind of termination condition of objective function f (x) as K-means clustering algorithm, determines objective function f
(x) formula is:
In formula, x indicates that data object, K indicate cluster centre number, ciIndicate ith cluster center, dist indicates that Europe is several
In distance;
S22. determine that K data object treats as the initial cluster center of K-means clustering algorithm from data set;
S23. all data objects are calculated to the Euclidean distance of K initial cluster center, it will be every according to the distance of distance
A data object is divided to away from nearest cluster centre;
S24. the cluster centre that them are recalculated for each cluster, obtains K new data cluster centre point;
S25. judge whether objective function f (x) restrains, if objective function f (x) restrains, terminate algorithm, output cluster knot
Fruit;If objective function f (x) does not restrain, i.e., new cluster centre is not consistent with the K cluster centre that last iteration obtains, then
Step S23 is repeated to step S25;
S3. it selects improved K-means clustering algorithm to classify missing data, judges the type of missing data,
The reference data set of filling missing data is determined according to the type of missing data, then data are concentrated using desired maximum value process
Missing data is filled.
Further, the S3 step specifically includes:
S31. distribution parameter is initialized;
S32. the initial value θ of expectation maximum value process is determined(0), initial value is current observation data set XobsAverage value;
S33. the greatest hope step of filling data, i.e. E step is calculated as follows:
E(Xfill|Xobs,θ(k))=θ(k-1)
In formula, k indicates the number of iterations, XfillIndicate Filling power, E (Xfill|Xobs,θ(k)) indicate filling data desired value,
θ(k)Indicate the evaluation parameter of kth step;
S34. the maximal possibility estimation parameter value of maximum expected value is calculated as follows, that is, maximizes step, i.e. M step:
In formula, p indicates observation data set XobsNumber, n indicate conceptual data number;XiFor the position of current manual fish
It sets, j is observation data set XobsNumber add 1;
S35. judge whether reach the condition of convergence, carry out next step S36 if meeting;Conversely, going to step
S33;Wherein, the condition of convergence is calculated according to the following formula:
|E(Xfill|Xobs,θ(k))-E(Xfill|Xobs,θ(k-1)) | < ε
In formula, ε indicates convergence parameter;
S36. predicted value X is exportedfill, it is filled according to this predicted value come the missing data concentrated to data.
Further, the S22 step specifically includes:Select 2 times of extreme point number of artificial fish-swarm algorithm as
The cluster number K of K-means algorithm, the position of each extreme point as the initial cluster center of K-means algorithm.
Compared with prior art, beneficial effect is:A kind of lacking based on improvement K-means clustering algorithm provided by the invention
Data filling method is lost, so that the time of filling missing data is shorter, effectively increases the effect of missing data filling.
Detailed description of the invention
Fig. 1 is the method for the present invention flow chart.
Fig. 2 is traditional K-means clustering algorithm and the classification provided by the invention based on improved K-means clustering algorithm
Accuracy comparison figure.
Fig. 3 is traditional K-means clustering algorithm and the missing provided by the invention based on improved K-means clustering algorithm
The average time-consuming comparison diagram of data filling.
Specific embodiment
Attached drawing only for illustration, is not considered as limiting the invention;In order to better illustrate this embodiment, attached
Scheme certain components to have omission, zoom in or out, does not represent the size of actual product;To those skilled in the art,
The omitting of some known structures and their instructions in the attached drawings are understandable.Being given for example only property of positional relationship is described in attached drawing
Illustrate, is not considered as limiting the invention.
Embodiment 1:
As shown in Figure 1, it is a kind of based on the missing data fill method for improving K-means clustering algorithm, include the following steps:
Step 1. utilizes artificial fish-swarm algorithm, determines the K value of K-means clustering algorithm;It specifically includes:
S11. artificial fish-swarm algorithm initializes, and generates the initial shoal of fish.Determine scale N=30, the number of iterations N of Artificial Fishc、
Maximum sounds out number Trynum=10, maximum number of iterations Nc_max=100, Artificial Fish maximum moving step length Step=0.3, crowded
Spend the field range Visual=0.75 of factor delta=10, Artificial Fish.Inside claimed range, 30 Artificial Fishs, shape are arbitrarily generated
At initial artificial fish-swarm.
S12. bulletin board is initialized.The present position of each original manual fish is calculated, is recorded using bulletin board
Calculate resulting optimal value.
S13. housing choice behavior.Each Artificial Fish is carried out bunch behavior and behavior of knocking into the back, and two kinds of behaviors is selected to calculate
The relatively good value arrived, the behavior of default are foraging behavior.After Artificial Fish individual has executed all behaviors, each time all
The record of the calculated value of its present position and bulletin board is compared.If its value is got well than the record of bulletin board, just use
Its present position replaces the record before bulletin board.The action process of Artificial Fish further comprises:
S131. foraging behavior.Assuming that the position of current manual fish is Xi, arbitrarily choose one inside its field range it is new
Position XjIf it two swarm similarity Yi< Yj, just further according to formula (1) row toward this direction;Otherwise, it chooses again
Another position Xj, thus it is speculated that whether meet condition.After iteration Trynum times, if still not meeting condition, then Artificial Fish is pressed
Formula (2) arbitrarily one step of transfer;
Xi|next=Xi+Rand()·Step (2)
In formula, Rand () obeys distribution U (0,1);
S132. bunch behavior.Assuming that the position of current manual fish is Xi, in its (d within sweep of the eyeij< Visual) it searches
Seek the quantity n of its partnerfWith center vector XcIf Yc/nf> δ Yi, indicate XcSurrounding is an optimal solution, at this moment,
Toward center XcBy formula (3), row further, otherwise, implements foraging behavior forward in direction;
S133. it knocks into the back behavior.Assuming that current manual's fish state is Xi, in its (d within sweep of the eyeij< Visual) explore Yj
It is X for maximum buddy locationmaxIf Yj/nf> δ Yi, indicate partner XmaxLocate swarm similarity with higher and less gathers around
It squeezes, then it can be to XmaxDirection is further forward according to formula (4), otherwise, implements foraging behavior;
S14. judge NcWhether N is equal toc.If so, terminating algorithm, optimal value is exported.If it is not, returning to S13, and Nc+1。
Step 2. proposes a kind of improved K-means clustering algorithm, including:
S21. it proposes a kind of termination condition of objective function f (x) as K-means clustering algorithm, determines objective function f
(x) formula is:
In formula, x indicates that data object, K indicate cluster centre number, ciIndicate ith cluster center, dist indicates that Europe is several
In distance;
S22. determine that K data object treats as the initial cluster center of K-means clustering algorithm from data set;Select
2 times of the extreme point number of artificial fish-swarm algorithm work as the position of each extreme point as the cluster number K of K-means algorithm
Make the initial cluster center of K-means algorithm;
S23. all data objects are calculated to the Euclidean distance of K initial cluster center, it will be every according to the distance of distance
A data object is divided to away from nearest cluster centre;
S24. the cluster centre that them are recalculated for each cluster, obtains K new data cluster centre point;
S25. judge whether objective function f (x) restrains, if objective function f (x) restrains, terminate algorithm, output cluster knot
Fruit;If objective function f (x) does not restrain, i.e., new cluster centre is not consistent with the K cluster centre that last iteration obtains, then
Step S23 is repeated to step S25.
Step 3. proposes a kind of improved expectation maximum value process, and the missing data concentrated to data is filled.Select
Improved K-means clustering algorithm classifies to missing data, judges the type of missing data, according to the class of missing data
Type determines the reference data set of filling missing data, then is filled out using the missing data that desired maximum value process concentrates data
It fills.Specifically include following steps:
S31. distribution parameter is initialized;
S32. the initial value θ of expectation maximum value process is determined(0), initial value is current observation data set XobsAverage value;
S33. the greatest hope step of filling data, i.e. E step is calculated as follows:
E(Xfill|Xobs,θ(k))=θ(k-1)
In formula, k indicates the number of iterations, XfillIndicate Filling power, E (Xfill|Xobs,θ(k)) indicate filling data desired value,
θ(k)Indicate the evaluation parameter of kth step;
S34. the maximal possibility estimation parameter value of maximum expected value is calculated as follows, that is, maximizes step, i.e. M step:
In formula, p indicates observation data set XobsNumber, n indicate conceptual data number;XiFor the position of current manual fish
It sets, j is observation data set XobsNumber add 1.
S35. judge whether reach the condition of convergence, carry out next step S36 if meeting;Conversely, going to step
S33;Wherein, the condition of convergence is calculated according to the following formula:
|E(Xfill|Xobs,θ(k))-E(Xfill|Xobs,θ(k-1)) | < ε
In formula, ε indicates convergence parameter;
S36. predicted value X is exportedfill, it is filled according to this predicted value come the missing data concentrated to data.
It after the above implementation steps, is computed, traditional K-means clustering algorithm and the K- based on artificial fish-swarm algorithm
The nicety of grading of means clustering algorithm compares as shown in Fig. 2, based on tradition K-means clustering algorithm and based on improvement K-means
The average time-consuming comparison of the missing data filling of clustering algorithm is as shown in figure 3, can be seen that this hair from the comparing result of Fig. 2 and Fig. 3
It is bright to provide a kind of fill method of more effective missing data.
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair
The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description
To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this
Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention
Protection scope within.
Claims (3)
1. a kind of based on the missing data fill method for improving K-means clustering algorithm, which is characterized in that include the following steps:
S1. artificial fish-swarm algorithm is utilized, determines the K value of K-means clustering algorithm;
S2. a kind of improved K-means clustering algorithm is proposed, including:
S21. it proposes a kind of termination condition of objective function f (x) as K-means clustering algorithm, determines objective function f's (x)
Formula is:
In formula, x indicates that data object, K indicate cluster centre number, ciIndicate ith cluster center, dist indicates Euclid
Distance;
S22. determine that K data object treats as the initial cluster center of K-means clustering algorithm from data set;
S23. calculate all data objects to K initial cluster center Euclidean distance, according to the distance of distance by every number
It is divided to according to object away from nearest cluster centre;
S24. the cluster centre that them are recalculated for each cluster, obtains K new data cluster centre point;
S25. judge whether objective function f (x) restrains, if objective function f (x) restrains, terminate algorithm, export cluster result;
If objective function f (x) does not restrain, i.e., new cluster centre is not consistent with the K cluster centre that last iteration obtains, then repeats
Execute step S23 to step S25;
S3. it selects improved K-means clustering algorithm to classify missing data, judges the type of missing data, according to
The type of missing data determines the reference data set of filling missing data, then the missing concentrated using desired maximum value process to data
Data are filled.
2. according to claim 1 a kind of based on the missing data fill method for improving K-means clustering algorithm, feature
It is, the S3 step specifically includes:
S31. distribution parameter is initialized;
S32. the initial value θ of expectation maximum value process is determined(0), initial value is current observation data set XobsAverage value;
S33. the greatest hope step of filling data, i.e. E step is calculated as follows:
E(Xfill|Xobs,θ(k))=θ(k-1)
In formula, k indicates the number of iterations, XfillIndicate Filling power, E (Xfill|Xobs,θ(k)) indicate filling data desired value, θ(k)Table
Show the evaluation parameter of kth step;
S34. the maximal possibility estimation parameter value of maximum expected value is calculated as follows, that is, maximizes step, i.e. M step:
In formula, p indicates observation data set XobsNumber, n indicate conceptual data number;XiFor the position of current manual fish, j is
Observe data set XobsNumber add 1;
S35. judge whether reach the condition of convergence, carry out next step S36 if meeting;Conversely, the S33 that gos to step;Its
In, the condition of convergence is calculated according to the following formula:
|E(Xfill|Xobs,θ(k))-E(Xfill|Xobs,θ(k-1)) | < ε
In formula, ε indicates convergence parameter;
S36. predicted value X is exportedfill, it is filled according to this predicted value come the missing data concentrated to data.
3. according to claim 2 a kind of based on the missing data fill method for improving K-means clustering algorithm, feature
It is, the S22 step specifically includes:2 times of the extreme point number of selection artificial fish-swarm algorithm are as K-means algorithm
Number K is clustered, the position of each extreme point as the initial cluster center of K-means algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810597825.4A CN108846434A (en) | 2018-06-11 | 2018-06-11 | A kind of missing data fill method based on improvement K-means clustering algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810597825.4A CN108846434A (en) | 2018-06-11 | 2018-06-11 | A kind of missing data fill method based on improvement K-means clustering algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108846434A true CN108846434A (en) | 2018-11-20 |
Family
ID=64211659
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810597825.4A Pending CN108846434A (en) | 2018-06-11 | 2018-06-11 | A kind of missing data fill method based on improvement K-means clustering algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108846434A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110275895A (en) * | 2019-06-25 | 2019-09-24 | 广东工业大学 | It is a kind of to lack the filling equipment of traffic data, device and method |
CN110659268A (en) * | 2019-08-15 | 2020-01-07 | 中国平安财产保险股份有限公司 | Data filling method and device based on clustering algorithm and computer equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104301996A (en) * | 2014-04-15 | 2015-01-21 | 河南科技大学 | Wireless sensor network positioning method |
CN106127262A (en) * | 2016-06-29 | 2016-11-16 | 海南大学 | The clustering method of one attribute missing data collection |
CN106407258A (en) * | 2016-08-24 | 2017-02-15 | 广东工业大学 | Missing data prediction method and apparatus |
CN107291765A (en) * | 2016-04-05 | 2017-10-24 | 南京航空航天大学 | The clustering method of processing missing data is planned based on DC |
-
2018
- 2018-06-11 CN CN201810597825.4A patent/CN108846434A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104301996A (en) * | 2014-04-15 | 2015-01-21 | 河南科技大学 | Wireless sensor network positioning method |
CN107291765A (en) * | 2016-04-05 | 2017-10-24 | 南京航空航天大学 | The clustering method of processing missing data is planned based on DC |
CN106127262A (en) * | 2016-06-29 | 2016-11-16 | 海南大学 | The clustering method of one attribute missing data collection |
CN106407258A (en) * | 2016-08-24 | 2017-02-15 | 广东工业大学 | Missing data prediction method and apparatus |
Non-Patent Citations (2)
Title |
---|
孙淑敏 等: "基于改进K-means算法的关键帧提取", 《计算机工程》 * |
梁秉毅 等: "基于优化决策树和EM的缺失数据填充算法", 《自动化与信息工程》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110275895A (en) * | 2019-06-25 | 2019-09-24 | 广东工业大学 | It is a kind of to lack the filling equipment of traffic data, device and method |
CN110659268A (en) * | 2019-08-15 | 2020-01-07 | 中国平安财产保险股份有限公司 | Data filling method and device based on clustering algorithm and computer equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Albouy et al. | Predicting trophic guild and diet overlap from functional traits: statistics, opportunities and limitations for marine ecology | |
CN101515338B (en) | Artificial fish-swarm algorithm based on overall information | |
CN108846434A (en) | A kind of missing data fill method based on improvement K-means clustering algorithm | |
CN105023011A (en) | HMM based crop phenology dynamic estimation method | |
CN109558889B (en) | Live pig comfort degree analysis method and device | |
CN108459406A (en) | Microscope auto-focusing window selection method based on artificial fish-swarm algorithm | |
CN109816087A (en) | Rough set attribute reduction method based on artificial fish-swarm and frog group's hybrid algorithm | |
CN109273097A (en) | A kind of automatic generation method and device of drug indication | |
CN112461342B (en) | Aquatic product weighing method, terminal equipment and storage medium | |
Liang et al. | Recent advances in particle swarm optimization via population structuring and individual behavior control | |
CN106682451B (en) | A kind of formula rate of biological tissue's simulation material determines method and system | |
Jagadeesan et al. | An artificial fish swarm optimized fuzzy mri image segmentation approach for improving identification of brain tumour | |
EP4094576A1 (en) | Estimation program, estimation method, and information processing device | |
Peterman et al. | Bayesian decision analysis and uncertainty in fisheries management | |
CN111222475A (en) | Pig tail biting detection method and device and storage medium | |
CN107491831A (en) | A kind of MIMO radar optimizing location method adaptively terminated under more monitor areas | |
CN114980007A (en) | Wireless sensor node deployment method, device, equipment and readable storage medium | |
CN111465031B (en) | Dynamic node scheduling method based on DQN algorithm in wireless body area network | |
Hutchinson et al. | Stochastic control theory applied to fishery management | |
Lu et al. | A hybrid of fish swarm algorithm and shuffled frog leaping algorithm for attribute reduction | |
Xu et al. | A/sup 4/C: an adaptive artificial ants clustering algorithm | |
Liu et al. | Research on Wireless Sensor Network Localization Based on An Improved Whale Optimization Algorithm | |
Rosales et al. | Oreochromis niloticus growth performance analysis using pixel transformation and pattern recognition | |
Li et al. | Stroke detection based on an improved artificial fish swarm algorithm | |
CN105426910B (en) | A kind of adaptive clustering scheme based on improvement ABC algorithm and DE Mutation Strategy |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181120 |